📂 Part of the Lance MLX collection on mlx-community.

Lance-3B-bf16 (MLX, image specialist)

MLX port of ByteDance Intelligent Creation Lab's Lance unified multimodal model — the image-specialist Lance_3B checkpoint, converted to bf16 for Apple Silicon. ~~6.19 B LLM parameters in MoT (Mixture-of-Transformer-Experts) layout, plus the Qwen2.5-VL ViT (~~669 M) and Lance's bundled Wan2.2 VAE (~705 M) for full image-task coverage.

Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.

Status

🟢 Production-ready for image tasks on Apple Silicon as of 2026-05-21.

Capability	Status
t2i (text → image)	✅ Photorealistic, prompt-aligned. 768² output at ~6.7 s/step.
image_edit (instruction-based editing)	✅ Identity + style + signature preservation verified. ~6.7 s/step.
x2t_image (image understanding / VQA)	✅ Content-correct across all 6 oracle cases.
KV cache for autoregressive decode	✅ 1.7×–2.8× speedup over no-cache baseline.

For video tasks (t2v, video_edit, x2t_video), see mlx-community/Lance-3B-Video-bf16. All six Lance task families are now validated end-to-end on Apple Silicon as of 2026-05-21.

The 48-channel Wan2.2 VAE is bundled here for convenience but also published standalone at mlx-community/Wan2.2-VAE-Lance-bf16 — both image_edit and the video pipelines need it.

Quickstart

Install from the source repo (will be on PyPI in a follow-up release):

git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync

Download this checkpoint:

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-bf16")

Text-to-image

from lance_mlx.pipeline.t2i import TextToImagePipeline

pipe = TextToImagePipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
image = pipe.generate(
    "A photorealistic tabby cat holding up a colorful STOP sign on a sunlit street.",
    height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
image.save("cat_with_stop.png")

Image editing

from lance_mlx.pipeline.image_edit import ImageEditPipeline

pipe = ImageEditPipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
edited = pipe.generate(
    input_image="portrait.jpg",
    instruction="Remove the hat from the painting.",
    height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
edited.save("portrait_no_hat.png")

Image VQA / understanding

from lance_mlx.pipeline.understanding import UnderstandingPipeline
from PIL import Image

pipe = UnderstandingPipeline.from_pretrained(
    lance_weights_dir=weights,
    vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate(
    Image.open("license_plate.png"),
    "What is the license plate number visible in this image?",
    max_new_tokens=64, prompt_style="lance",
)
print(answer)

Performance (M5 Max 128 GB, macOS 26.2, MLX bf16)

Task	Configuration	Wall-clock
t2i	768² × 30 steps × CFG=4.0	~201 s
image_edit	768² × 30 steps × CFG=4.0	~201 s
x2t_image	6 oracle cases (5–100 token answers), KV-cached	~34 s combined

KV cache scales with answer length: 1.7× speedup on a 5-token answer, 2.8× on a ~100-token answer.

Architecture

Two expert towers (LLM_UND, LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm.
Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens → LLM_UND (autoregressive next-token); Wan2.2 VAE latent tokens → LLM_GEN (flow-matching velocity prediction). No learned gate.
MaPE — modality-aware RoPE with per-modality temporal anchor (image-gen tokens re-anchored to t=1000).
Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent — Lance bundles its own VAE; do NOT use the public 16-ch wan2.2_vae.safetensors).
Bidirectional attention within latent block — causal_mask OR full_and_noise_mask per upstream data/data_utils.py::create_sparse_mask. Without this, the noisy-VAE position 0 of a 2304-token image grid can only see itself + text, producing blurry outputs across all prompts.
Untied LM head.

Files in this repo

File	Size	Purpose
`model.safetensors`	12.37 GB	LLM weights (1021 tensors, both UND + GEN towers)
`vit.safetensors`	1.34 GB	Qwen2.5-VL ViT (semantic encoder for x2t_image + image_edit)
`vae.safetensors`	1.41 GB	Lance's bundled Wan2.2 VAE (encoder + decoder, 48-ch)
`config.json`	–	`Qwen2_5_VLForConditionalGeneration` config with `tie_word_embeddings=false`
`conversion_report.json`	–	Provenance of safetensors conversion (PyTorch → MLX bf16)
`tokenizer.json` / `vocab.json`	–	Qwen2.5-VL vocabulary (151,936 tokens)

Provenance

Source: bytedance-research/Lance/Lance_3B/model.safetensors (1021 tensors, 6.185 B params). Conversion script: scripts/02_convert.py in the lance-mlx repo. The script:

Loads original PyTorch safetensors, keeps F32 for normalization scales (per Phase 1b notes).
Strips the language_model. prefix; the MLX LanceModel is the root, not nested.
Splits the bundled ViT (vit_model.* keys) into a sibling vit.safetensors for parity with the Lance_3B distribution shape.
Re-keys llm2vae.weight/bias and time_embedder.mlp.{0,2}.{weight,bias} to match scaffolded MLX modules.

Wan2.2 VAE source: bytedance-research/Lance/Wan2.2_VAE.pth → scripts/06_convert_wan_vae.py. Roundtrip MAD on a real photo at 768² is ~7/255 in u8 domain.

Limitations

bf16 only. 4-bit + 8-bit quantization in progress. Naive INT4 has been observed to degrade the GEN expert (per Reza2kn/lance-quant's findings); quantization needs per-tower calibration.
English + Chinese prompts work; other languages are training-distribution-limited (Qwen2.5-VL was trained primarily on en + zh).
No streaming / batching API yet. Single-image, single-prompt generation only.
CFG runs the LLM twice per step. A future KV-cache for the text + clean-ref prefix would save ~30% on image_edit.

Documented divergences from upstream PyTorch

Outputs differ in low-level pixel detail from a CUDA reference run on the same seed/prompt (~1–5% per-pixel deviation expected from bf16 vs fp32, MLX RoPE vs PyTorch RoPE rounding, and a small number of intermediate-norm precision steps). Semantic correctness preserved across the 6 x2t_image oracle cases and all visually-verified t2i + image_edit prompts.
x2t_image answers differ stylistically from Phase 0 oracle (PyTorch) — consistent across all 6 cases. Tracked as a Phase 5 parity follow-up; does not affect content correctness.

License

This MLX port: Apache 2.0.

Underlying weights:

Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
Wan2.2 VAE: Apache 2.0 (Alibaba).
Qwen2.5-VL: Apache 2.0 (Alibaba).

See NOTICE for full attribution.

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Model tree for mlx-community/Lance-3B-bf16

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance

Finetuned

(5)

this model

Collection including mlx-community/Lance-3B-bf16

Lance MLX

Collection

Feature-complete MLX port of ByteDance Lance: t2i, image_edit, x2t_image, t2v, video_edit, x2t_video. • 4 items • Updated about 11 hours ago • 1

Paper for mlx-community/Lance-3B-bf16

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Paper • 2605.18678 • Published 4 days ago • 69

mlx-community
/

Lance-3B-bf16