Lance-3B AWQ INT4 (image checkpoint)
4-bit AWQ-quantized variant of bytedance-research/Lance, the Lance_3B image-focused checkpoint (text-to-image, image edit, image understanding).
File-size reduction: 24.7 GB β ~6 GB (4Γ) Inference VRAM (LLM only, bf16 activations): ~13 GB β ~6 GB
What's different from the video sibling
Lance ships two checkpoints:
Lance_3Bβ image-focused (this one). 24.7 GB F32 source. No bundled ViT, smallerlatent_pos_embed(image grid only).Lance_3B_Videoβ video-focused. 28.4 GB F32 source. Bundles the Qwen2.5-VL ViT in its safetensors + larger video-gridlatent_pos_embed. Quantized variant:Reza2kn/Lance-3B-Video-AWQ-INT4.
This image checkpoint relies on the standalone Qwen2.5-VL-ViT for vision encoding (also bundled in the official Lance HF repo; not redistributed here).
What was quantized
Same MoT-aware scheme as the video sibling β 504 Linear modules in language_model.* (252 understanding-path + 252 generation-expert _moe_gen variants), 360 with AWQ scale fusion into the preceding RMSNorm, 144 with plain per-group min-max (o_proj, down_proj). The ViT, projection layers, time embedder, latent positional embeds, and lm_head are kept in bf16.
See the video sibling README for the full per-component table β it's identical here.
Calibration
x2t_image(Lance's 6-sample example set, full 30 timesteps) β 252 und-path linears, 85.3 M tokens of activation datat2i(Lance's 11-sample example set, 2 denoising timesteps) β all 504 linears (both und and gen paths)- Merged: 252 und + 252 gen Linears all with activation data
File layout
Lance_3B-AWQ-INT4/
βββ awq_state_dict.safetensors # ~6 GB: packed INT4 + bf16 pass-through
βββ awq_meta.json # per-weight scheme + group_size + shape
βββ README.md
Storage layout per quantized linear is identical to the video sibling β see that repo for the qweight / scales / zeros byte layout.
How to use
Same as the video sibling. The Lance source ships a custom Lance PreTrainedModel (in github.com/bytedance/Lance). Use the runtime swap-in approach: build Lance normally, then replace nn.Linear modules in language_model.* with the WQLinearINT4 reference module and stream the AWQ buffers in.
A complete reproduction (calibration scripts + WQLinearINT4 + run_quant_eval.py) is at: https://github.com/Reza2kn/lance-quant
Quality
Side-by-side on Lance's bundled x2t_image example (6 cases) β outputs match the bf16 baseline to within typical AWQ tolerance. NaΓ―ve min-max INT4 produces gibberish ("the loose subs ifaβ¦"); proper AWQ calibration recovers it ("Yes, the largest segment is greater than the sum of all the other segments.").
License
Apache 2.0, inherited from the base model.
v2 update β group_size 128 β 64
Re-quantized with --group_size 64 (was 128 in v1). Same AWQ calibration data,
same scale-fusion recipe. Storage: ~6.15 GB (was 6.02 GB); the +2.5% size is
the cost of 2Γ more per-group scales.
Quality jumped substantially on Lance's bundled x2t_image bench:
| variant | exact-match | char similarity | difflib ratio | word Jaccard |
|---|---|---|---|---|
| v1 (group_size=128) | 33.3 % | 60.4 % | 53.7 % | 55.3 % |
| v2 (group_size=64) | 50.0 % | 69.8 % | 62.1 % | 66.3 % |
The biggest win is on case 4 ("$ spent on promotional events 1998") β v1 hallucinated entities ("Scott Levin and his family") around the correct number; v2 produces the exact baseline output:
"According to the data from the proprietary market research, the total amount spent on the promotional meetings and events during 1998 was approximately $1.3 billion."
The smaller group size reduces the per-group outlier impact in o_proj and
down_proj (the linears we can't fuse AWQ scales into), which were responsible
for the long-form generation drift.
Recipe & eval at: https://github.com/Reza2kn/lance-quant#v2-group_size-64