World Pilot: Steering Vision-Language-Action Models with World-Action Priors
Abstract
World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks.
Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/
Community
Cool Embodied Manipulation framework
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models (2026)
- Being-H0.7: A Latent World-Action Model from Egocentric Videos (2026)
- AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing (2026)
- EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control (2026)
- RepWAM: World Action Modeling with Representation Visual-Action Tokenizers (2026)
- OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation (2026)
- MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.12403 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 1
Chedan86/WorldPilot-LIBERO-precompute
Spaces citing this paper 0
No Space linking this paper