Hybrid ACT+Diffusion β€” ALOHA Single-Arm (Left) β€” 40k steps

Custom HybridACTDiffusion policy: ACT visual encoder (ResNet18 + 4-layer Transformer, mean-pooled) feeds a Diffusion U-Net decoder (FiLM conditioning, DDPM training, DDIM 10-step inference). No VAE β€” diffusion handles multimodal action distributions directly.

This is the 40k-step retrain (workstream S004) matching S003's step count for direct architectural comparison vs the shipped ACT-40k baseline. For the initial 13.4k baseline, see JHeisler/aloha_solo_left_act_diffusion.

Architecture

Images (cam_high, cam_left_wrist) + State (dim=9)
     β”‚
     β–Ό
ACT Encoder (ResNet18 β†’ 4-layer Transformer) β†’ mean-pool β†’ (B, 512) global cond vector
     β”‚
     β–Ό
Diffusion U-Net (DiffusionConditionalUnet1d, FiLM modulation, down_dims=(256,512))
     β”‚  DDPM training (100 timesteps) / DDIM 10-step inference
     β–Ό
Action chunks (chunk_size=100, action_dim=9)

Training Config

Field Value
Architecture HybridACTDiffusion (ACT encoder + Diffusion U-Net) β€” see lerobot/common/policies/hybrid_act_diffusion/
Dataset JHeisler/aloha_solo_left_4_6_26 β€” 50 episodes, 29,785 samples, 30 fps
State / action dim 9 / 9
Cameras cam_high, cam_left_wrist (3Γ—480Γ—640 each)
Steps 40,000
Batch size 28 (adaptive DOE winner β€” beats bs=24 by 6.8% throughput at 91.3 smpl/s)
Learning rate 3.5e-5 (linear-scaled from bs=24's 3e-5)
Total samples seen 1.12M (37 epochs over the dataset)
AMP enabled
torch.compile enabled
Save freq every 10,000 steps (10k / 20k / 30k / 40k checkpoints)
Diffusion scheduler DDPM training (100 timesteps, squaredcos_cap_v2), DDIM at inference (10 steps)
Final loss (DDPM noise-pred MSE) 0.003–0.007
Final grad norm ~0.10–0.18
Wall clock ~3h 53min on RTX A4500
LeRobot pin 96c7052777aca85d4e55dfba8f81586103ba8f61 (with custom hybrid_act_diffusion policy added)

Project Lineage

Workstream Model Steps Samples HF
S001 ACT 13,400 640K act_left
S002 Hybrid ACT+Diffusion 13,400 321K act_diffusion
S003 ACT (shipped) 40,000 1.92M act_left_40k
S004 Hybrid ACT+Diffusion 40,000 1.12M this repo

S003 vs S004 is the apples-to-apples architectural comparison: same dataset, same step count, ACT-VAE vs ACT-Diffusion decoder.

Notes on loss comparability

DDPM noise-prediction MSE (this model) and ACT's L1+KL combo (S001/S003) are different loss surfaces β€” absolute loss values are NOT directly comparable across architectures. The right comparison is offline action L1 on held-out episodes or real-robot rollout success rate.

Usage

# Requires lerobot pinned to 96c7052 with hybrid_act_diffusion policy package added
from lerobot.common.policies.hybrid_act_diffusion.modeling_hybrid_act_diffusion import HybridACTDiffusionPolicy
policy = HybridACTDiffusionPolicy.from_pretrained("JHeisler/aloha_solo_left_act_diffusion_40k")

Citation / Course

EN.525.681 school project β€” JHU Whiting School of Engineering. Team: Jake Heisler, Laura Kroening, Purushottam Shukla.

Code reference: HuggingFace LeRobot at commit 96c7052 with custom hybrid policy package.

Downloads last month
49
Safetensors
Model size
47.2M params
Tensor type
F32
Β·
Video Preview
loading

Dataset used to train JHeisler/aloha_solo_left_act_diffusion_40k