X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Paper • 2510.10274 • Published • 16
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
X-VLA: Soft-Prompted Transformer as a Scalable Cross-Embodiment Vision-Language-Action Model. Paper: https://arxiv.org/pdf/2510.10274
checkpoints/
color_object/
ckpt-30000/
model.safetensors # fine-tuned weights (step 30000)
config.json
tokenizer.json
tokenizer_config.json
vocab.json
merges.txt
preprocessor_config.json
special_tokens_map.json
state.json
models/ # model architecture (Florence2 + X-VLA)
configuration_florence2.py
configuration_xvla.py
modeling_florence2.py
modeling_xvla.py
processing_xvla.py
action_hub.py
transformer.py
deploy/X-VLA-Pt/ # base pretrained model config & code
evaluation/ # eval clients for Calvin, LIBERO, Simpler, etc.
slurm_scripts/ # SLURM finetune scripts for all conflict splits
train.py # full training entry point
peft_train.py # LoRA / PEFT fine-tuning entry point
deploy.py # inference server launcher
requirements.txt
git clone https://huggingface.co/yqi19/xvla
cd xvla
pip install -r requirements.txt
The checkpoint is already in this repo at checkpoints/color_object/ckpt-30000/.
To download programmatically:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="yqi19/xvla", local_dir="./xvla")
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained(
"checkpoints/color_object/ckpt-30000",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
"checkpoints/color_object/ckpt-30000",
trust_remote_code=True,
)
model.run(processor, host="0.0.0.0", port=8000)
The inference endpoint will be available at:
POST http://localhost:8000/act
import requests
import numpy as np
import json_numpy
server_url = "http://localhost:8000/act"
proprio = np.zeros(7, dtype=np.float32) # joint / EE state
image = np.zeros((256, 256, 3), dtype=np.uint8) # RGB observation
payload = {
"proprio": json_numpy.dumps(proprio),
"language_instruction": "Pick up the red block and place it on the green object",
"image0": json_numpy.dumps(image),
"domain_id": 0, # domain id used during training
"steps": 10, # diffusion denoising steps
}
response = requests.post(server_url, json=payload, timeout=10)
actions = np.array(response.json()["action"], dtype=np.float32)
print(f"Predicted actions shape: {actions.shape}") # e.g. (30, 20)
| Component | Dims | Description |
|---|---|---|
| EE position | 3 | xyz translation |
| EE rotation | 6 | 6D rotation representation |
| Gripper | 1 | open/close binary |
| Padding | 10 | zeros (single-arm) |
| Total | 20 | per action step |
Post-processing rotation:
from datasets.utils import rotate6d_to_xyz
import numpy as np
action_final = np.concatenate([
action_pred[:3],
rotate6d_to_xyz(action_pred[3:9]),
np.array([1.0 if action_pred[9] > 0.5 else 0.0])
])
accelerate launch \
--mixed_precision bf16 \
train.py \
--models checkpoints/color_object/ckpt-30000 \
--train_metas_path /path/to/meta_files.json \
--learning_rate 1e-4 \
--learning_coef 0.1 \
--iters 50000 \
--freeze_steps 1000 \
--warmup_steps 2000
See finetune_readme.md for the full data preparation guide.
@article{zheng2025x,
title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and others},
journal = {arXiv preprint arXiv:2510.10274},
year = {2025}
}