📂 Part of the Lance MLX collection on mlx-community.

Lance-3B-bf16 (MLX, image specialist)

MLX port of ByteDance Intelligent Creation Lab's Lance unified multimodal model — the image-specialist Lance_3B checkpoint, converted to bf16 for Apple Silicon. 6.19 B LLM parameters in MoT (Mixture-of-Transformer-Experts) layout, plus the Qwen2.5-VL ViT (669 M) and Lance's bundled Wan2.2 VAE (~705 M) for full image-task coverage.

Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.

Status

🟢 Production-ready for image tasks on Apple Silicon as of 2026-05-21.

Capability Status
t2i (text → image) ✅ Photorealistic, prompt-aligned. 768² output at ~6.7 s/step.
image_edit (instruction-based editing) ✅ Identity + style + signature preservation verified. ~6.7 s/step.
x2t_image (image understanding / VQA) ✅ Content-correct across all 6 oracle cases.
KV cache for autoregressive decode ✅ 1.7×–2.8× speedup over no-cache baseline.

For video tasks (t2v, video_edit, x2t_video), see mlx-community/Lance-3B-Video-bf16. All six Lance task families are now validated end-to-end on Apple Silicon as of 2026-05-21.

The 48-channel Wan2.2 VAE is bundled here for convenience but also published standalone at mlx-community/Wan2.2-VAE-Lance-bf16 — both image_edit and the video pipelines need it.

Quickstart

Install from the source repo (will be on PyPI in a follow-up release):

git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync

Download this checkpoint:

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-bf16")

Text-to-image

from lance_mlx.pipeline.t2i import TextToImagePipeline

pipe = TextToImagePipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
image = pipe.generate(
    "A photorealistic tabby cat holding up a colorful STOP sign on a sunlit street.",
    height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
image.save("cat_with_stop.png")

Image editing

from lance_mlx.pipeline.image_edit import ImageEditPipeline

pipe = ImageEditPipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
edited = pipe.generate(
    input_image="portrait.jpg",
    instruction="Remove the hat from the painting.",
    height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
edited.save("portrait_no_hat.png")

Image VQA / understanding

from lance_mlx.pipeline.understanding import UnderstandingPipeline
from PIL import Image

pipe = UnderstandingPipeline.from_pretrained(
    lance_weights_dir=weights,
    vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate(
    Image.open("license_plate.png"),
    "What is the license plate number visible in this image?",
    max_new_tokens=64, prompt_style="lance",
)
print(answer)

Performance (M5 Max 128 GB, macOS 26.2, MLX bf16)

Task Configuration Wall-clock
t2i 768² × 30 steps × CFG=4.0 ~201 s
image_edit 768² × 30 steps × CFG=4.0 ~201 s
x2t_image 6 oracle cases (5–100 token answers), KV-cached ~34 s combined

KV cache scales with answer length: 1.7× speedup on a 5-token answer, 2.8× on a ~100-token answer.

Architecture

  • Two expert towers (LLM_UND, LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm.
  • Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens → LLM_UND (autoregressive next-token); Wan2.2 VAE latent tokens → LLM_GEN (flow-matching velocity prediction). No learned gate.
  • MaPE — modality-aware RoPE with per-modality temporal anchor (image-gen tokens re-anchored to t=1000).
  • Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent — Lance bundles its own VAE; do NOT use the public 16-ch wan2.2_vae.safetensors).
  • Bidirectional attention within latent blockcausal_mask OR full_and_noise_mask per upstream data/data_utils.py::create_sparse_mask. Without this, the noisy-VAE position 0 of a 2304-token image grid can only see itself + text, producing blurry outputs across all prompts.
  • Untied LM head.

Files in this repo

File Size Purpose
model.safetensors 12.37 GB LLM weights (1021 tensors, both UND + GEN towers)
vit.safetensors 1.34 GB Qwen2.5-VL ViT (semantic encoder for x2t_image + image_edit)
vae.safetensors 1.41 GB Lance's bundled Wan2.2 VAE (encoder + decoder, 48-ch)
config.json Qwen2_5_VLForConditionalGeneration config with tie_word_embeddings=false
conversion_report.json Provenance of safetensors conversion (PyTorch → MLX bf16)
tokenizer.json / vocab.json Qwen2.5-VL vocabulary (151,936 tokens)

Provenance

Source: bytedance-research/Lance/Lance_3B/model.safetensors (1021 tensors, 6.185 B params). Conversion script: scripts/02_convert.py in the lance-mlx repo. The script:

  • Loads original PyTorch safetensors, keeps F32 for normalization scales (per Phase 1b notes).
  • Strips the language_model. prefix; the MLX LanceModel is the root, not nested.
  • Splits the bundled ViT (vit_model.* keys) into a sibling vit.safetensors for parity with the Lance_3B distribution shape.
  • Re-keys llm2vae.weight/bias and time_embedder.mlp.{0,2}.{weight,bias} to match scaffolded MLX modules.

Wan2.2 VAE source: bytedance-research/Lance/Wan2.2_VAE.pthscripts/06_convert_wan_vae.py. Roundtrip MAD on a real photo at 768² is ~7/255 in u8 domain.

Limitations

  • bf16 only. 4-bit + 8-bit quantization in progress. Naive INT4 has been observed to degrade the GEN expert (per Reza2kn/lance-quant's findings); quantization needs per-tower calibration.
  • English + Chinese prompts work; other languages are training-distribution-limited (Qwen2.5-VL was trained primarily on en + zh).
  • No streaming / batching API yet. Single-image, single-prompt generation only.
  • CFG runs the LLM twice per step. A future KV-cache for the text + clean-ref prefix would save ~30% on image_edit.

Documented divergences from upstream PyTorch

  • Outputs differ in low-level pixel detail from a CUDA reference run on the same seed/prompt (~1–5% per-pixel deviation expected from bf16 vs fp32, MLX RoPE vs PyTorch RoPE rounding, and a small number of intermediate-norm precision steps). Semantic correctness preserved across the 6 x2t_image oracle cases and all visually-verified t2i + image_edit prompts.
  • x2t_image answers differ stylistically from Phase 0 oracle (PyTorch) — consistent across all 6 cases. Tracked as a Phase 5 parity follow-up; does not affect content correctness.

License

This MLX port: Apache 2.0.

Underlying weights:

  • Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
  • Wan2.2 VAE: Apache 2.0 (Alibaba).
  • Qwen2.5-VL: Apache 2.0 (Alibaba).

See NOTICE for full attribution.

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Links

Downloads last month
47
Safetensors
Model size
6B params
Tensor type
BF16
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Lance-3B-bf16

Finetuned
(5)
this model

Collection including mlx-community/Lance-3B-bf16

Paper for mlx-community/Lance-3B-bf16