PiZero FM VLA model card

This model was developed by INSAIT and KU Leuven.

Code and model weights are provided under the Gemma license.

This repo provides a fully Transformers-compatible export for the flow-matching (FM) policy.

Use with ๐Ÿค— Transformers

This export uses native transformers AutoConfig/AutoModel/AutoProcessor wrappers. It does not require an external databib installation.

Important FM behavior

  • FM inference is chunk-based.
  • Unlike AR models, FM does not use reset_test_time_cache / refresh_test_time_vlm / next_test_time_action.
  • Use generate_action_chunk(...) (or plain forward) to get a full action chunk.
  • Run processor input preprocessing before inference and processor output postprocessing after inference.

Example usage

import numpy as np
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_id = "you2who/arboreal-green-raven"

model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image = Image.open("path/to/main_image.png").convert("RGB")

batch = processor.preprocess_inputs(
    chat=["pick up the cup", ""],
    images={"main": [image]},
    ee_pose_translation=np.zeros((1, 1, 3), dtype=np.float32),
    ee_pose_rotation=np.array([[[0.0, 0.0, 0.0, 1.0]]], dtype=np.float32),
    gripper=np.zeros((1, 1), dtype=np.float32),
    joints=np.zeros((1, 1, 7), dtype=np.float32),
    dataset_name=np.array(["bridge"]),
    inference_mode=True,
)

with torch.inference_mode():
    output = model.generate_action_chunk(
        input_ids=batch["input_ids"].to("cuda"),
        attention_mask=batch["attn_mask"].to("cuda").any(dim=1),
        images={k: v.to("cuda") for k, v in batch["images"].items()},
        ee_pose_translation=batch["ee_pose_translation"].to("cuda"),
        ee_pose_rotation=batch["ee_pose_rotation"].to("cuda"),
        gripper=batch["gripper"].unsqueeze(-1).to("cuda"),
        joints=batch["joints"].to("cuda"),
        control_tokens_ids=batch["control_tokens_ids"],
    )

control_plan = processor.postprocess_actions(
    model_output=output,
    dataset_name=np.array(["bridge"]),
)

print(control_plan.translation_m.shape, control_plan.rotmat.shape, control_plan.gripper_prob.shape)

Summary

  • Model type: Vision-Language-Action with flow-matching chunk generation
  • Inference API: generate_action_chunk + processor postprocessing
  • License: Gemma Terms of Use
Downloads last month
3
Safetensors
Model size
3B params
Tensor type
F32
ยท
Video Preview
loading

Collection including you2who/paligemma-flowmatch-bridge