LAP-3B

Language-Action Pre-Training Enables Zero-Shot Cross-Embodiment Transfer

🌐 Website: https://lap-vla.github.io/
📄 Paper: https://arxiv.org/abs/2602.10556
💻 Code: https://github.com/lihzha/lap

Download

You can download the LAP checkpoint directly from the Hugging Face Hub.

Using the Hugging Face CLI (recommended)

Install the Hugging Face Hub CLI:

pip install -U huggingface_hub

Download the checkpoint to the expected directory:

hf download lihzha/LAP-3B --local-dir ./checkpoint/lap

After downloading, the checkpoint will be located at:

./checkpoint/lap

This matches the default path expected by the LAP codebase.

Alternative: Python API

You can also download the checkpoint programmatically:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="lihzha/LAP-3B",
    local_dir="./checkpoint/lap"
)

Model Summary

LAP-3B is a Vision-Language-Action (VLA) model trained using Language-Action Pre-Training (LAP), which represents actions as language-actions, allowing the model to preserve semantic reasoning capabilities from large vision-language models while learning robot control.

This design enables strong zero-shot transfer across robot embodiments, allowing the same model to generalize to different robots without architecture changes or embodiment-specific fine-tuning.

For model and training details, please refer to our paper.


Key Capabilities

LAP-3B demonstrates strong cross-embodiment generalization across multiple real robot platforms.

Highlights from the paper:

  • >50% average zero-shot success rate on unseen robots
  • ~2× improvement over prior VLA models on cross-embodiment benchmarks
  • Successful deployment on multiple robot platforms, including:
    • Franka Panda
    • Kinova
    • YAM
    • DROID

Supported manipulation tasks include:

  • pick and place
  • object sorting
  • container placement
  • towel manipulation

Limitations

While LAP-3B demonstrates strong cross-embodiment transfer, several limitations remain:

  • Current experiments primarily focus on single-arm manipulation.
  • Performance may degrade in settings involving highly dexterous manipulation or rich visual distractors.

Future work includes extending LAP to more complex embodiments and tasks, including:

  • bimanual robots
  • dexterous hands
  • mobile manipulation systems

Intended Use

LAP-3B is intended for:

  • research in robot learning
  • vision-language-action models
  • cross-embodiment policy learning
  • manipulation policy research

The model is not intended for safety-critical deployments without additional validation.

Citation

If you use LAP-3B in your research, please cite:

@article{zha2026lap,
  title={LAP: Language-Action Pre-Training Enables Zero-Shot Cross-Embodiment Transfer},
  author={Zha, Lihan and Hancock, Asher and Zhang, Mingtong and Yin, Tenny and Huang, Yixuan and Shah, Dhruv and Ren, Allen Z. and Majumdar, Anirudha},
  journal={arXiv preprint arXiv:2602.10556},
  year={2026}
}
Downloads last month
20
Video Preview
loading

Collection including lihzha/LAP-3B

Paper for lihzha/LAP-3B