Motif-Video 2B

A micro-budget text-to-video diffusion transformer from Motif Technologies

📑 Technical Report (coming soon) | 🤗 Hugging Face | 🌐 Project Page (not updated)

🔥 News

[2026-04-14] We release Motif-Video 2B, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full technical report.

📖 Introduction

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.

Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:

Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.

These are paired with a micro-budget training recipe combining TREAD token routing and early-phase REPA with a frozen V-JEPA teacher — to our knowledge, the first time this combination has been applied to text-to-video training.

On VBench, Motif-Video 2B reaches 83.76%, the highest Total Score among open-source models we evaluate, surpassing Wan2.1-14B at 7× fewer parameters and roughly an order of magnitude less training data.

Motif-Video 2B architecture

✨ Highlights

Two tasks, one set of weights. A single checkpoint handles both text-to-video (T2V) and image-to-video (I2V) generation, trained jointly without a learnable task-type embedding.
Up to 720p, 121 frames. The final model generates 720p video at 121 frames under the standard rectified flow-matching sampler.
Architectural specialization over brute-force scale. Three-stage backbone with role-separated dual-stream / single-stream / DDT decoder layers.
Shared Cross-Attention. Stabilizes text alignment under long video-token sequences by grounding cross-attention K/V in the self-attention manifold.
Micro-budget recipe. TREAD token routing (≈27% per-step FLOP reduction) + early-phase REPA with V-JEPA teacher + offline bucket-balanced sampler (≈90% data utilization, up from ≈20% baseline).
Open and reproducible. Trained on ~64×H200 GPUs with FSDP2, full curriculum and recipe documented in the technical report.

🏗️ Architecture

Motif-Video 2B is a flow-matching diffusion transformer organized around a single principle: each component is assigned a well-defined responsibility, and components with conflicting objectives are not asked to share capacity.

Component	Choice
Text encoder	T5Gemma2 (encoder–decoder, UL2-adapted Gemma 3)
Video tokenizer	Wan2.1 VAE (8×8 spatial, 4× temporal compression), 2×2×1 patchify
Backbone	12 dual-stream + 16 single-stream + 8 DDT decoder layers
Hidden dim / heads	1536 / 12 heads × 128
Normalization	QK-normalization throughout
Position encoding	RoPE
Cross-attention	Shared Cross-Attention in the single-stream stage
Objective	Rectified flow matching (velocity prediction)
I2V conditioning	First-frame latent + SigLIP image embeddings, with timestep-aware blur

A high-level walkthrough of the role separation:

Dual-stream stage (12 layers). Text and video tokens are processed through separate self-attention pathways, exchanging information via cross-attention. This prevents premature feature entanglement before either modality has formed coherent representations.
Single-stream stage (16 layers). Text and video tokens attend freely in a joint sequence. Shared Cross-Attention is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
DDT decoder (8 layers). A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.

For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the technical report.

🚀 Quickstart / Usage

Requirements

Python 3.10+
CUDA-capable GPU with 24GB+ VRAM (e.g., A100, H100, RTX 4090)

pip install "diffusers>=0.35.2" "transformers>=5.0.0" torch accelerate ftfy einops sentencepiece regex Pillow

Text-to-Video (T2V)

import torch
from diffusers import AdaptiveProjectedGuidance, DiffusionPipeline
from diffusers.utils import export_to_video

guider = AdaptiveProjectedGuidance(
    guidance_scale=8.0,
    adaptive_projected_guidance_rescale=12.0,
    adaptive_projected_guidance_momentum=0.1,
    use_original_formulation=True,
)

pipe = DiffusionPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    custom_pipeline="pipeline_motif_video",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    guider=guider,
)
pipe = pipe.to("cuda")

output = pipe(
    prompt="A category-five hurricane, viewed from inside the eye, reveals a circular stadium of cloud walls rising to fifty thousand feet with an eerie disk of blue sky directly overhead. Shot from a NOAA reconnaissance aircraft mounted camera, the perspective looks outward toward the eyewall — a near-vertical curtain of rotating cloud and lightning that is simultaneously terrifying and transcendent. The inner surface of the eyewall catches the setting sun, painting it in improbable shades of peach and rose. The camera slowly pans 360 degrees to complete one full revolution, capturing the entire coliseum of the storm. Below, the ocean surface is a white blur of foam and spray. The documentary-style cinematography strips away all artifice to present the storm as an entity of pure elemental power.",
    height=736,
    width=1280,
    num_frames=121,
    num_inference_steps=50,
)

export_to_video(output.frames[0], "output.mp4", fps=24)

Image-to-Video (I2V)

import torch
from diffusers import AdaptiveProjectedGuidance, DiffusionPipeline
from diffusers.utils import export_to_video, load_image

guider = AdaptiveProjectedGuidance(
    guidance_scale=8.0,
    adaptive_projected_guidance_rescale=12.0,
    adaptive_projected_guidance_momentum=0.1,
    use_original_formulation=True,
)

pipe = DiffusionPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    custom_pipeline="pipeline_motif_video",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    guider=guider,
)
pipe = pipe.to("cuda")

image = load_image("https://huggingface.co/Motif-Technologies/Motif-Video-2B/resolve/main/assets/i2v_sample.jpg")

output = pipe(
    prompt="Three friends stride through a sun-bleached meadow as a warm breeze ripples the tall dry grass around their legs. The woman on the left turns her head to share a quiet laugh, the woman in the center pushes a loose curl behind her ear, and the man on the right tilts his face toward the sky. The camera drifts gently alongside them at walking pace, handheld, with soft overcast light.",
    image=image,
    height=736,
    width=1280,
    num_frames=121,
    num_inference_steps=50,
)

export_to_video(output.frames[0], "output.mp4", fps=24)

CLI Inference

# Text-to-Video
python inference.py \
  --prompt "A time-lapse of a flower blooming in a dark room, dramatic lighting" \
  --output t2v_output.mp4

# Image-to-Video
python inference.py \
  --image assets/i2v_sample.jpg \
  --prompt "Three friends stride through a meadow as a warm breeze ripples the tall grass" \
  --output i2v_output.mp4

See inference.py for all available options (--help).

Recommended Settings

Parameter	Default	Notes
Resolution	1280x736	720p, best quality
Frames	121	~5 seconds at 24fps
Guidance scale	8.0
Scheduler shift	15.0	Pre-configured in scheduler config
Inference steps	50
dtype	bfloat16	Recommended for H100/A100

📊 Performance

VBench

Motif-Video 2B achieves the highest Total Score among open-source models we evaluate.

Model	Params	Total	Quality	Semantic
Wan2.2-T2V (prompt-opt.)	A14B	84.23	85.42	79.50
Motif-Video 2B (Ours)	2B	83.76	84.59	80.44
SANA-Video	2B	83.71	84.35	81.35
Wan2.1-T2V	14B	83.69	85.59	76.11
OpenSora 2.0 (T2I2V)	11B	83.60	84.40	80.30
Wan2.1-T2V	1.3B	83.31	85.23	75.65
HunyuanVideo	13B	83.24	85.09	75.82
CogVideoX1.5-5B (prompt-opt.)	5B	82.17	82.78	79.76
Step-Video-T2V	30B	81.83	84.46	71.28
LTX-Video	2B	80.00	82.30	70.79

Notable per-dimension highlights for Motif-Video 2B (open-source):

Spatial Relationship: 83.02% — best among open-source models
Semantic Score: 80.44% — highest among open-source models reporting per-dimension results
Object Class: 92.93%, Multiple Objects: 77.29%, Imaging Quality: 70.50% — second-best in their categories

The full 16-dimension breakdown is in Table 3 of the technical report.

A note on VBench vs. perceptual quality. Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.

Human evaluation

In a blind pairwise study against six contemporaneous open-source baselines (SANA-Video, LTX-Video 2, Wan2.1-14B, Wan2.1-1.3B, Wan2.2-5B, CogVideoX-5B) on 40 LLM-generated prompts, Motif-Video 2B is preferred over both SANA-Video (similar parameter count) and Wan2.1-1.3B (similar parameter count, larger training corpus) on prompt-following and video-fidelity axes. Wan2.1-14B remains the preferred model overall, consistent with its 7× larger parameter count and substantially larger training data.

🎬 Showcase

Text-to-Video

Motif-Video 2B T2V samples

Image-to-Video

Motif-Video 2B I2V samples

⚠️ Limitations

We report limitations as the boundary conditions under which the design decisions in this report should be interpreted, not as caveats.

Micro-scale semantic distortion. Motif-Video 2B occasionally produces sub-object-level artifacts that leave the category label intact but break perceptual plausibility — distorted hands on close-up human subjects, degraded body structure under high-displacement motion, and attribute leakage between visually similar co-present subjects. We attribute these primarily to data coverage rather than backbone design.
Temporal failures. Three distinct modes that frame-level metrics do not surface: (i) physically implausible liquid / cloth / collision dynamics, (ii) coherence loss under high scene complexity (multi-agent crowds), and (iii) unintended mid-clip scene transitions in long sequences.
Recipe components are evaluated jointly, not in isolation. We do not present per-component ablations for Shared Cross-Attention, the DDT decoder, REPA phasing, or TREAD routing at full scale. Readers should interpret our results as evidence that the composed recipe works at 2B, not as a marginal-contribution claim about any single component.

We view temporal stability and data coverage — not architectural depth — as the primary remaining ceilings on this model. Both are the most natural axes for a future iteration that the current architecture is built to absorb.

📚 Citation

If you find Motif-Video 2B useful in your research, please cite:

@techreport{motifvideo2b2026,
  title  = {Motif-Video 2B: Technical Report},
  author = {Motif Technologies},
  year   = {2026},
  institution = {Motif Technologies},
  url    = {}
}

🙏 Acknowledgements

We build on a number of excellent open-source projects, including the Wan2.1 VAE [Wan Team, 2025], T5Gemma / Gemma 3 [Google], TREAD [Krause et al., 2025], REPA with the V-JEPA family of visual encoders [Bardes et al.], DDT [Wang et al.], and the broader diffusers and Accelerate ecosystems. Compute was provisioned on Microsoft Azure and orchestrated with SkyPilot on Kubernetes.

📄 License

This model is released under the Apache 2.0 License. See LICENSE for details.

Downloads last month: 201

Collection including Motif-Technologies/Motif-Video-2B

Motif Video

Collection

1 item • Updated about 22 hours ago