X-VLA -- color_object Checkpoint

X-VLA: Soft-Prompted Transformer as a Scalable Cross-Embodiment Vision-Language-Action Model. Paper: https://arxiv.org/pdf/2510.10274

Repository Structure

checkpoints/
  color_object/
    ckpt-30000/
      model.safetensors          # fine-tuned weights (step 30000)
      config.json
      tokenizer.json
      tokenizer_config.json
      vocab.json
      merges.txt
      preprocessor_config.json
      special_tokens_map.json
      state.json
models/                          # model architecture (Florence2 + X-VLA)
  configuration_florence2.py
  configuration_xvla.py
  modeling_florence2.py
  modeling_xvla.py
  processing_xvla.py
  action_hub.py
  transformer.py
deploy/X-VLA-Pt/                 # base pretrained model config & code
evaluation/                      # eval clients for Calvin, LIBERO, Simpler, etc.
slurm_scripts/                   # SLURM finetune scripts for all conflict splits
train.py                         # full training entry point
peft_train.py                    # LoRA / PEFT fine-tuning entry point
deploy.py                        # inference server launcher
requirements.txt

Loading the color_object Checkpoint and Running Inference

1. Install dependencies

git clone https://huggingface.co/yqi19/xvla
cd xvla
pip install -r requirements.txt

2. Download the checkpoint

The checkpoint is already in this repo at checkpoints/color_object/ckpt-30000/. To download programmatically:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="yqi19/xvla", local_dir="./xvla")

3. Launch the inference server

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained(
    "checkpoints/color_object/ckpt-30000",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "checkpoints/color_object/ckpt-30000",
    trust_remote_code=True,
)

model.run(processor, host="0.0.0.0", port=8000)

The inference endpoint will be available at:

POST http://localhost:8000/act

4. Query the server (client side)

import requests
import numpy as np
import json_numpy

server_url = "http://localhost:8000/act"

proprio = np.zeros(7, dtype=np.float32)           # joint / EE state
image   = np.zeros((256, 256, 3), dtype=np.uint8) # RGB observation

payload = {
    "proprio": json_numpy.dumps(proprio),
    "language_instruction": "Pick up the red block and place it on the green object",
    "image0": json_numpy.dumps(image),
    "domain_id": 0,  # domain id used during training
    "steps": 10,     # diffusion denoising steps
}

response = requests.post(server_url, json=payload, timeout=10)
actions = np.array(response.json()["action"], dtype=np.float32)
print(f"Predicted actions shape: {actions.shape}")  # e.g. (30, 20)

5. Action format (EE6D)

Component	Dims	Description
EE position	3	xyz translation
EE rotation	6	6D rotation representation
Gripper	1	open/close binary
Padding	10	zeros (single-arm)
Total	20	per action step

Post-processing rotation:

from datasets.utils import rotate6d_to_xyz
import numpy as np

action_final = np.concatenate([
    action_pred[:3],
    rotate6d_to_xyz(action_pred[3:9]),
    np.array([1.0 if action_pred[9] > 0.5 else 0.0])
])

Fine-tuning on Your Own Data

accelerate launch \
    --mixed_precision bf16 \
    train.py \
    --models checkpoints/color_object/ckpt-30000 \
    --train_metas_path /path/to/meta_files.json \
    --learning_rate 1e-4 \
    --learning_coef 0.1 \
    --iters 50000 \
    --freeze_steps 1000 \
    --warmup_steps 2000

See finetune_readme.md for the full data preparation guide.

Citation

@article{zheng2025x,
  title   = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
  author  = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and others},
  journal = {arXiv preprint arXiv:2510.10274},
  year    = {2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for yqi19/xvla

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Paper • 2510.10274 • Published Oct 11, 2025 • 16