Abstract
Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.
Community
Current Vision Language Action models are great at static tasks but fail drastically when targets are moving. For example, the $\pi_0$ model sees its success rate plummet from 44.8% in static settings to just 7.5% in dynamic ones.
Our Solutions:
DOMINO Benchmark: We built a large scale benchmark featuring 35 tasks with hierarchical dynamic complexities and over 110K expert trajectories.
PUMA Architecture: We propose a new dynamic aware model that integrates scene centric historical optical flow and specialized world queries to implicitly forecast object centric future states.
Key Results: PUMA achieves state of the art performance, yielding a 6.3% absolute improvement in success rate over baselines. Additionally, we show that training on dynamic data fosters robust spatiotemporal representations that seamlessly transfer back to static tasks
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation (2026)
- StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation (2026)
- FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution (2026)
- AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation (2026)
- DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control (2026)
- LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies (2026)
- RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper