LAP-3B
Language-Action Pre-Training Enables Zero-Shot Cross-Embodiment Transfer
🌐 Website: https://lap-vla.github.io/
📄 Paper: https://arxiv.org/abs/2602.10556
💻 Code: https://github.com/lihzha/lap
Download
You can download the LAP checkpoint directly from the Hugging Face Hub.
Using the Hugging Face CLI (recommended)
Install the Hugging Face Hub CLI:
pip install -U huggingface_hub
Download the checkpoint to the expected directory:
hf download lihzha/LAP-3B --local-dir ./checkpoint/lap
After downloading, the checkpoint will be located at:
./checkpoint/lap
This matches the default path expected by the LAP codebase.
Alternative: Python API
You can also download the checkpoint programmatically:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="lihzha/LAP-3B",
local_dir="./checkpoint/lap"
)
Model Summary
LAP-3B is a Vision-Language-Action (VLA) model trained using Language-Action Pre-Training (LAP), which represents actions as language-actions, allowing the model to preserve semantic reasoning capabilities from large vision-language models while learning robot control.
This design enables strong zero-shot transfer across robot embodiments, allowing the same model to generalize to different robots without architecture changes or embodiment-specific fine-tuning.
For model and training details, please refer to our paper.
Key Capabilities
LAP-3B demonstrates strong cross-embodiment generalization across multiple real robot platforms.
Highlights from the paper:
- >50% average zero-shot success rate on unseen robots
- ~2× improvement over prior VLA models on cross-embodiment benchmarks
- Successful deployment on multiple robot platforms, including:
- Franka Panda
- Kinova
- YAM
- DROID
Supported manipulation tasks include:
- pick and place
- object sorting
- container placement
- towel manipulation
Limitations
While LAP-3B demonstrates strong cross-embodiment transfer, several limitations remain:
- Current experiments primarily focus on single-arm manipulation.
- Performance may degrade in settings involving highly dexterous manipulation or rich visual distractors.
Future work includes extending LAP to more complex embodiments and tasks, including:
- bimanual robots
- dexterous hands
- mobile manipulation systems
Intended Use
LAP-3B is intended for:
- research in robot learning
- vision-language-action models
- cross-embodiment policy learning
- manipulation policy research
The model is not intended for safety-critical deployments without additional validation.
Citation
If you use LAP-3B in your research, please cite:
@article{zha2026lap,
title={LAP: Language-Action Pre-Training Enables Zero-Shot Cross-Embodiment Transfer},
author={Zha, Lihan and Hancock, Asher and Zhang, Mingtong and Yin, Tenny and Huang, Yixuan and Shah, Dhruv and Ren, Allen Z. and Majumdar, Anirudha},
journal={arXiv preprint arXiv:2602.10556},
year={2026}
}
- Downloads last month
- 20