PiZero FM VLA model card

This model was developed by INSAIT and KU Leuven.

Code and model weights are provided under the Gemma license.

This repo provides a fully Transformers-compatible export for the flow-matching (FM) policy.

Use with 🤗 Transformers

This export uses native transformers AutoConfig/AutoModel/AutoProcessor wrappers. It does not require an external databib installation.

Important FM behavior

FM inference is chunk-based.
Unlike AR models, FM does not use reset_test_time_cache / refresh_test_time_vlm / next_test_time_action.
Use generate_action_chunk(...) (or plain forward) to get a full action chunk.
Run processor input preprocessing before inference and processor output postprocessing after inference.

Example usage

import numpy as np
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_id = "you2who/arboreal-green-raven"

model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image = Image.open("path/to/main_image.png").convert("RGB")

batch = processor.preprocess_inputs(
    chat=["pick up the cup", ""],
    images={"main": [image]},
    ee_pose_translation=np.zeros((1, 1, 3), dtype=np.float32),
    ee_pose_rotation=np.array([[[0.0, 0.0, 0.0, 1.0]]], dtype=np.float32),
    gripper=np.zeros((1, 1), dtype=np.float32),
    joints=np.zeros((1, 1, 7), dtype=np.float32),
    dataset_name=np.array(["bridge"]),
    inference_mode=True,
)

with torch.inference_mode():
    output = model.generate_action_chunk(
        input_ids=batch["input_ids"].to("cuda"),
        attention_mask=batch["attn_mask"].to("cuda").any(dim=1),
        images={k: v.to("cuda") for k, v in batch["images"].items()},
        ee_pose_translation=batch["ee_pose_translation"].to("cuda"),
        ee_pose_rotation=batch["ee_pose_rotation"].to("cuda"),
        gripper=batch["gripper"].unsqueeze(-1).to("cuda"),
        joints=batch["joints"].to("cuda"),
        control_tokens_ids=batch["control_tokens_ids"],
    )

control_plan = processor.postprocess_actions(
    model_output=output,
    dataset_name=np.array(["bridge"]),
)

print(control_plan.translation_m.shape, control_plan.rotmat.shape, control_plan.gripper_prob.shape)

Summary

Model type: Vision-Language-Action with flow-matching chunk generation
Inference API: generate_action_chunk + processor postprocessing
License: Gemma Terms of Use

Downloads last month: 3

Safetensors

Model size

3B params

Tensor type

F32

Video Preview

Robotics

Collection including you2who/paligemma-flowmatch-bridge

AR-VLA

Collection

3 items • Updated Apr 28