Instructions to use mlx-community/Lance-3B-Video-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Lance-3B-Video-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Lance-3B-Video-bf16 mlx-community/Lance-3B-Video-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Lance-3B-Video-bf16 (MLX, video specialist)
- Status — 🟢 t2v in production after Phase 5j position-ID fix (2026-05-21)
- Sample generations — full Phase 0 oracle pass (post-fix)
- Why a separate "Video" checkpoint?
- Quickstart
- Performance (M5 Max 128 GB)
- Files in this repo
- Provenance
- Tips
- Limitations
- Architecture (shared with the image specialist)
- License
- Citation
- Links
- Status — 🟢 t2v in production after Phase 5j position-ID fix (2026-05-21)
📂 Part of the Lance MLX collection on mlx-community.
Lance-3B-Video-bf16 (MLX, video specialist)
MLX port of ByteDance Intelligent Creation Lab's Lance — the video-specialist Lance_3B_Video checkpoint, converted to bf16 for Apple Silicon. ~6.44 B LLM parameters + 669 M Qwen2.5-VL ViT bundled, with the 126,976-entry latent_pos_embed table needed for video-scale latent grids.
Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.
Status — 🟢 t2v in production after Phase 5j position-ID fix (2026-05-21)
Phase 5j watercolor fix shipped 2026-05-21. Root cause was a port-side bug in _build_position_ids: the latent block's mrope (t, h, w) grid was anchored to base = text_len_before_latents, so with our verbose chat template the latent positions drifted with prompt length out of Qwen2.5-VL's training distribution (visual tokens train against grid-ORIGIN coords, not concatenated with text positions). The drift smeared high-frequency detail into a painterly/watercolor aesthetic. Fix: anchor the latent grid at base = 0 regardless of prompt length. Default for TextToVideoPipeline.generate is now latent_pos_base=0.
Phase 5j A/B at 256²×17f red-panda-surfing oracle (seed=42, 30 steps, CFG=4.0): legacy (base=text_len) → watercolor; fix (base=0) → photoreal. Scale-confirmed at 480×704×17f: CGI-quality red panda holding a yellow surfboard horizontally, water spray + atmospheric clouds, correct composition.
This closes a seven-phase investigation (4b/4c, 5d, 5e research engagement, 5f RockTalk-weights triangulation, 5g/5h refuted candidates, 5i bisect, 5j fix) tracked in github issue #2 (now closed). Full root-cause analysis: notes/phase5j_THE_FIX.md.
| Capability | Status | Notes |
|---|---|---|
| t2v at 256² × 17f | 🟢 Photoreal | At lower resolutions, subject composition may simplify (surfboard orientation can vary) |
| t2v at 480×704 × 17f (n_lat = 6,600) | 🟢 CGI-quality | Cap, surfboard horizontal, water spray, atmospheric clouds — production-ready |
| t2v at 512² × 17f | 🟢 Photoreal | Similar profile |
| t2v at 768² × 13f (n_lat = 9,216) | 🟢 Photoreal | |
| t2v at 768² × ≥17f (n_lat ≥ 11,520) | 🟡 Partial degradation | Tracked in issue #1 — separate bug class (n_lat ceiling), NOT the watercolor |
| t2v at 768² × 50f (n_lat = 29,952) | ⚠️ Pure-noise output at this scale | Same issue #1 territory; the position-ID fix doesn't address it |
| x2t_video (video VQA / captioning) | ✅ Validated against Phase 0 oracle. Unaffected by the t2v bug — ViT + UND-tower path only | |
| video_edit (instruction-based) | 🟢 Same envelope as t2v after the fix |
Production-ready for t2v up to n_lat ≈ 9,216 (256²–768²×13f, 480×704×17f). Use the demo script at scripts/10_t2v_demo.py for a one-command path.
For production-quality image tasks (t2i, image_edit, x2t_image), use mlx-community/Lance-3B-bf16. (The 8-bit variant on HF is known-broken as of 2026-05-22 pending DWQ-calibrated re-quantization.)
Sample generations — full Phase 0 oracle pass (post-fix)
8 oracle prompts (red panda surfing, rabbit on skateboard, panda cub on ramp, woman at pottery wheel, panda vs robot boxing, tropical sunset, woman at piano, horse in cloud valley) generated at 768² × 13 frames, 30 steps, CFG=4.0, seed=43 through the Phase 5j-fixed pipeline. Side-by-side midframes vs the PyTorch reference oracle:
Every prompt produces production-tier CGI/photoreal output. Composition varies by prompt (different camera angles, prop interpretations) but no watercolor / smearing / degenerate failure modes.
Standout single-prompt comparisons
Red panda surfing (the prompt that originally exposed the watercolor bug) — now photoreal with clearly defined cap, fur, surfboard, water spray:
Woman at grand piano — the strongest direct-parity match in the set; matched composition, lighting, marble-hall setting, hands on keys with anatomically correct fingers:
Panda vs robot boxing match — interesting case: the MLX port correctly resolves panda + robot as distinct subjects where the PyTorch oracle collapsed them into two pandas (a classic distinct-subject failure mode for text-to-video):
480×704 production-scale midframe
Why a separate "Video" checkpoint?
ByteDance ships two variants of Lance that differ in fine-tuning:
Lance_3B— image specialist. Crystal-clear photorealistic t2i.Lance_3B_Video— video specialist. Same architecture, further fine-tuned on video data. Bundles the Qwen2.5-VL ViT (669 M) and the larger 126,976-entrylatent_pos_embedtable that addresses video-resolution token grids.
Quickstart
Install from the lance-mlx source repo:
git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync
Download this checkpoint:
from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-Video-bf16")
Text-to-video
from lance_mlx.pipeline.t2v import TextToVideoPipeline
pipe = TextToVideoPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
"Five balls on a wooden table: two blue, three green.",
num_frames=17, height=768, width=768,
num_steps=30, cfg_scale=4.0, seed=42,
)
# frames is np.ndarray of shape (T_decoded, H, W, 3) uint8
Encode to MP4 with imageio:
import imageio
with imageio.get_writer("out.mp4", fps=12, codec="libx264") as writer:
for f in frames:
writer.append_data(f)
Video understanding
from lance_mlx.pipeline.understanding import UnderstandingPipeline
pipe = UnderstandingPipeline.from_pretrained(
lance_weights_dir=weights,
vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate_video(
video="my_video.mp4",
question="Describe what happens in this video.",
num_sample_frames=16, target_h=224, target_w=224,
max_new_tokens=256, prompt_style="lance",
)
print(answer)
Validated content-correct against the Phase 0 oracle's cooking VQA case (kitchen + pan + spatula + tomato + meat + stirring matched).
Video editing
from lance_mlx.pipeline.video_edit import VideoEditPipeline
pipe = VideoEditPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
input_video="my_video.mp4",
instruction="Change all the balls to a deep red color.",
height=256, width=256, num_frames=17,
num_steps=30, cfg_scale=4.0, seed=42,
)
Performance (M5 Max 128 GB)
| Task | Configuration | Wall-clock |
|---|---|---|
| t2v | 256² × 16f, 30 steps, CFG=4.0 | ~33 s |
| t2v | 512² × 16f, 30 steps, CFG=4.0 | ~60 s |
| t2v | 768² × 13f, 30 steps, CFG=4.0 | ~145 s |
| t2v | 768² × 17f, 30 steps, CFG=4.0 | ~20 min |
| t2v | 768² × 49f, 30 steps, CFG=4.0 | ~2¼ hours (impractical) |
CFG doubles the forward cost since cond + uncond run sequentially. Attention scales O(N²) in latent-token count, so high-frame, high-resolution combos become quickly impractical. KV cache for the text prefix is a Phase 5 follow-up.
Files in this repo
| File | Size | Purpose |
|---|---|---|
model.safetensors |
12.87 GB | LLM weights (1021 tensors, both UND + GEN towers, with 126,976-entry latent_pos_embed) |
vit.safetensors |
1.34 GB | Qwen2.5-VL ViT (semantic encoder for x2t_video) |
vae.safetensors |
1.41 GB | Lance's bundled Wan2.2 VAE (also available standalone as mlx-community/Wan2.2-VAE-Lance-bf16) |
config.json |
– | Qwen2_5_VLForConditionalGeneration config |
conversion_report.json |
– | Provenance |
tokenizer.json / vocab.json |
– | Qwen2.5-VL vocabulary |
Provenance
Source: bytedance-research/Lance/Lance_3B_Video/model.safetensors (1411 tensors including bundled ViT; 6.437 B LLM + 0.669 B ViT params).
Converted via scripts/02_convert.py. The bundled ViT is extracted to a sibling vit.safetensors with the vit_model. prefix stripped, matching the layout convention of the image-specialist repo.
Tips
- Use concrete-subject prompts. "Five red apples in a bowl" works better than "the joy of friendship in motion." The model can render abstract scenes, but the painterly aesthetic on already-abstract subjects can read as overly abstract.
- Smaller scales iterate faster. 256² × 16 frames is the fastest test config (~33 s); good for prompt iteration. Scale up once you find a prompt you like.
- English + Chinese prompts work. Other languages are out of distribution (Qwen2.5-VL was trained primarily on en + zh).
Limitations
- bf16 only. 4-bit + 8-bit quantization in progress (Phase 5b). Naive INT4 has been observed to degrade the GEN expert per Reza2kn/lance-quant's findings; quantization needs per-tower calibration.
- No streaming or batched generation.
- CFG doubles forward cost. A future KV-cache for the text + clean-ref prefix would save ~30% per step.
Architecture (shared with the image specialist)
- Two expert towers (
LLM_UND,LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm. - Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens →
LLM_UND(autoregressive); Wan2.2 VAE latent tokens →LLM_GEN(flow-matching velocity prediction). No learned gate. - MaPE — modality-aware RoPE with per-modality temporal anchor.
- Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent).
- Bidirectional attention within latent block.
- Untied LM head.
License
This MLX port: Apache 2.0.
Underlying weights:
- Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
- Wan2.2 VAE: Apache 2.0 (Alibaba).
- Qwen2.5-VL: Apache 2.0 (Alibaba).
See NOTICE for attribution.
Citation
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}
Links
- MLX port code + phase notes:
github.com/xocialize/lance-mlx - Original PyTorch model:
bytedance-research/Lance - Image specialist (production):
mlx-community/Lance-3B-bf16 - Wan2.2 VAE (standalone):
mlx-community/Wan2.2-VAE-Lance-bf16
- Downloads last month
- 139
Quantized




