JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence Paper • 2606.14777 • Published 13 days ago • 197
UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer Paper • 2606.16255 • Published 8 days ago • 14
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation Paper • 2606.17030 • Published 8 days ago • 28
Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions Paper • 2606.09076 • Published 15 days ago • 61
Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models Paper • 2606.11025 • Published 14 days ago • 41
SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer Paper • 2605.30409 • Published 26 days ago • 41
Representation Forcing for Bottleneck-Free Unified Multimodal Models Paper • 2605.31604 • Published 25 days ago • 61
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments Paper • 2605.30280 • Published 26 days ago • 146
Geo-Align: Video Generation Alignment via Metric Geometry Reward Paper • 2605.23903 • Published May 22 • 10
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models Paper • 2605.21573 • Published May 20 • 111
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer Paper • 2605.15178 • Published May 14 • 91
Continuous-Time Distribution Matching for Few-Step Diffusion Distillation Paper • 2605.06376 • Published May 7 • 27
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models Paper • 2605.05204 • Published May 6 • 28
Lightning Unified Video Editing via In-Context Sparse Attention Paper • 2605.04569 • Published May 6 • 18