📂 Part of the Lance MLX collection on mlx-community.

Lance-3B-Video-bf16 (MLX, video specialist)

MLX port of ByteDance Intelligent Creation Lab's Lance — the video-specialist Lance_3B_Video checkpoint, converted to bf16 for Apple Silicon. ~6.44 B LLM parameters + 669 M Qwen2.5-VL ViT bundled, with the 126,976-entry latent_pos_embed table needed for video-scale latent grids.

Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.

Status — 🟢 t2v in production after Phase 5j position-ID fix (2026-05-21)

Phase 5j watercolor fix shipped 2026-05-21. Root cause was a port-side bug in _build_position_ids: the latent block's mrope (t, h, w) grid was anchored to base = text_len_before_latents, so with our verbose chat template the latent positions drifted with prompt length out of Qwen2.5-VL's training distribution (visual tokens train against grid-ORIGIN coords, not concatenated with text positions). The drift smeared high-frequency detail into a painterly/watercolor aesthetic. Fix: anchor the latent grid at base = 0 regardless of prompt length. Default for TextToVideoPipeline.generate is now latent_pos_base=0.

Phase 5j A/B at 256²×17f red-panda-surfing oracle (seed=42, 30 steps, CFG=4.0): legacy (base=text_len) → watercolor; fix (base=0) → photoreal. Scale-confirmed at 480×704×17f: CGI-quality red panda holding a yellow surfboard horizontally, water spray + atmospheric clouds, correct composition.

This closes a seven-phase investigation (4b/4c, 5d, 5e research engagement, 5f RockTalk-weights triangulation, 5g/5h refuted candidates, 5i bisect, 5j fix) tracked in github issue #2 (now closed). Full root-cause analysis: notes/phase5j_THE_FIX.md.

Capability	Status	Notes
t2v at 256² × 17f	🟢 Photoreal	At lower resolutions, subject composition may simplify (surfboard orientation can vary)
t2v at 480×704 × 17f (n_lat = 6,600)	🟢 CGI-quality	Cap, surfboard horizontal, water spray, atmospheric clouds — production-ready
t2v at 512² × 17f	🟢 Photoreal	Similar profile
t2v at 768² × 13f (n_lat = 9,216)	🟢 Photoreal
t2v at 768² × ≥17f (n_lat ≥ 11,520)	🟡 Partial degradation	Tracked in issue #1 — separate bug class (n_lat ceiling), NOT the watercolor
t2v at 768² × 50f (n_lat = 29,952)	⚠️ Pure-noise output at this scale	Same issue #1 territory; the position-ID fix doesn't address it
x2t_video (video VQA / captioning)	✅ Validated against Phase 0 oracle. Unaffected by the t2v bug — ViT + UND-tower path only
video_edit (instruction-based)	🟢 Same envelope as t2v after the fix

Production-ready for t2v up to n_lat ≈ 9,216 (256²–768²×13f, 480×704×17f). Use the demo script at scripts/10_t2v_demo.py for a one-command path.

For production-quality image tasks (t2i, image_edit, x2t_image), use mlx-community/Lance-3B-bf16. (The 8-bit variant on HF is known-broken as of 2026-05-22 pending DWQ-calibrated re-quantization.)

Sample generations — full Phase 0 oracle pass (post-fix)

8 oracle prompts (red panda surfing, rabbit on skateboard, panda cub on ramp, woman at pottery wheel, panda vs robot boxing, tropical sunset, woman at piano, horse in cloud valley) generated at 768² × 13 frames, 30 steps, CFG=4.0, seed=43 through the Phase 5j-fixed pipeline. Side-by-side midframes vs the PyTorch reference oracle:

Every prompt produces production-tier CGI/photoreal output. Composition varies by prompt (different camera angles, prop interpretations) but no watercolor / smearing / degenerate failure modes.

Standout single-prompt comparisons

Red panda surfing (the prompt that originally exposed the watercolor bug) — now photoreal with clearly defined cap, fur, surfboard, water spray:
Woman at grand piano — the strongest direct-parity match in the set; matched composition, lighting, marble-hall setting, hands on keys with anatomically correct fingers:
Panda vs robot boxing match — interesting case: the MLX port correctly resolves panda + robot as distinct subjects where the PyTorch oracle collapsed them into two pandas (a classic distinct-subject failure mode for text-to-video):

480×704 production-scale midframe

Why a separate "Video" checkpoint?

ByteDance ships two variants of Lance that differ in fine-tuning:

Lance_3B — image specialist. Crystal-clear photorealistic t2i.
Lance_3B_Video — video specialist. Same architecture, further fine-tuned on video data. Bundles the Qwen2.5-VL ViT (669 M) and the larger 126,976-entry latent_pos_embed table that addresses video-resolution token grids.

Quickstart

Install from the lance-mlx source repo:

git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync

Download this checkpoint:

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-Video-bf16")

Text-to-video

from lance_mlx.pipeline.t2v import TextToVideoPipeline

pipe = TextToVideoPipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
    "Five balls on a wooden table: two blue, three green.",
    num_frames=17, height=768, width=768,
    num_steps=30, cfg_scale=4.0, seed=42,
)
# frames is np.ndarray of shape (T_decoded, H, W, 3) uint8

Encode to MP4 with imageio:

import imageio
with imageio.get_writer("out.mp4", fps=12, codec="libx264") as writer:
    for f in frames:
        writer.append_data(f)

Video understanding

from lance_mlx.pipeline.understanding import UnderstandingPipeline

pipe = UnderstandingPipeline.from_pretrained(
    lance_weights_dir=weights,
    vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate_video(
    video="my_video.mp4",
    question="Describe what happens in this video.",
    num_sample_frames=16, target_h=224, target_w=224,
    max_new_tokens=256, prompt_style="lance",
)
print(answer)

Validated content-correct against the Phase 0 oracle's cooking VQA case (kitchen + pan + spatula + tomato + meat + stirring matched).

Video editing

from lance_mlx.pipeline.video_edit import VideoEditPipeline

pipe = VideoEditPipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
    input_video="my_video.mp4",
    instruction="Change all the balls to a deep red color.",
    height=256, width=256, num_frames=17,
    num_steps=30, cfg_scale=4.0, seed=42,
)

Performance (M5 Max 128 GB)

Task	Configuration	Wall-clock
t2v	256² × 16f, 30 steps, CFG=4.0	~33 s
t2v	512² × 16f, 30 steps, CFG=4.0	~60 s
t2v	768² × 13f, 30 steps, CFG=4.0	~145 s
t2v	768² × 17f, 30 steps, CFG=4.0	~20 min
t2v	768² × 49f, 30 steps, CFG=4.0	~2¼ hours (impractical)

CFG doubles the forward cost since cond + uncond run sequentially. Attention scales O(N²) in latent-token count, so high-frame, high-resolution combos become quickly impractical. KV cache for the text prefix is a Phase 5 follow-up.

Files in this repo

File	Size	Purpose
`model.safetensors`	12.87 GB	LLM weights (1021 tensors, both UND + GEN towers, with 126,976-entry latent_pos_embed)
`vit.safetensors`	1.34 GB	Qwen2.5-VL ViT (semantic encoder for x2t_video)
`vae.safetensors`	1.41 GB	Lance's bundled Wan2.2 VAE (also available standalone as `mlx-community/Wan2.2-VAE-Lance-bf16`)
`config.json`	–	`Qwen2_5_VLForConditionalGeneration` config
`conversion_report.json`	–	Provenance
`tokenizer.json` / `vocab.json`	–	Qwen2.5-VL vocabulary

Provenance

Source: bytedance-research/Lance/Lance_3B_Video/model.safetensors (1411 tensors including bundled ViT; 6.437 B LLM + 0.669 B ViT params). Converted via scripts/02_convert.py. The bundled ViT is extracted to a sibling vit.safetensors with the vit_model. prefix stripped, matching the layout convention of the image-specialist repo.

Tips

Use concrete-subject prompts. "Five red apples in a bowl" works better than "the joy of friendship in motion." The model can render abstract scenes, but the painterly aesthetic on already-abstract subjects can read as overly abstract.
Smaller scales iterate faster. 256² × 16 frames is the fastest test config (~33 s); good for prompt iteration. Scale up once you find a prompt you like.
English + Chinese prompts work. Other languages are out of distribution (Qwen2.5-VL was trained primarily on en + zh).

Limitations

bf16 only. 4-bit + 8-bit quantization in progress (Phase 5b). Naive INT4 has been observed to degrade the GEN expert per Reza2kn/lance-quant's findings; quantization needs per-tower calibration.
No streaming or batched generation.
CFG doubles forward cost. A future KV-cache for the text + clean-ref prefix would save ~30% per step.

Architecture (shared with the image specialist)

Two expert towers (LLM_UND, LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm.
Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens → LLM_UND (autoregressive); Wan2.2 VAE latent tokens → LLM_GEN (flow-matching velocity prediction). No learned gate.
MaPE — modality-aware RoPE with per-modality temporal anchor.
Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent).
Bidirectional attention within latent block.
Untied LM head.

License

This MLX port: Apache 2.0.

Underlying weights:

Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
Wan2.2 VAE: Apache 2.0 (Alibaba).
Qwen2.5-VL: Apache 2.0 (Alibaba).

See NOTICE for attribution.

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Model tree for mlx-community/Lance-3B-Video-bf16

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance