MiniMax-M3 — NVFP4
NVFP4 quantization of MiniMaxAI/MiniMax-M3: 428B-parameter (23B active) multimodal MoE with MiniMax Sparse Attention (MSA), 1M-token context. 259 GB — runs on 4x 96 GB GPUs at TP4 with headroom for KV cache.
What is quantized
| Component | Precision |
|---|---|
Routed experts — gate_proj / up_proj / down_proj (128 experts x 57 layers) |
NVFP4 (block-16 E2M1, FP8-E4M3 block scales, FP32 global scale) |
Dense-MLP (layers 0-2) & shared-expert down_proj |
NVFP4 |
Dense-MLP (layers 0-2) & shared-expert gate_up_proj |
BF16 — see note below |
| Attention (q/k/v/o, all 60 layers) | BF16 |
Router (mlp.gate + e_score_correction_bias) |
BF16 / FP32 |
Embeddings, lm_head, all norms |
BF16 |
| Vision tower + multimodal projector (full VL stack) | BF16 — fully preserved |
The repo is a complete VL model: image/video processors, chat template, and tokenizer ship alongside the weights.
Why dense/shared gate_up_proj is BF16 (an honest note, not hand-waving)
This is a scar from a calibration-config bug, kept and explained rather than papered over.
The first export used an over-broad exclusion glob *gate* — meant only for the MoE router (mlp.gate) — which silently disabled the quantizers for every gate-named module: the routed experts' gate_proj and the dense/shared gate_up_proj. That produced a 456 GB "NVFP4" export still carrying ~275 GB of un-quantized BF16 gates.
Rather than spend ~7 h recalibrating, the amaxes were recovered post-hoc — and the recovery is only exact for some of them:
- Routed
gate_projhas an exact twin: its siblingup_projconsumes the identical input tensor, so the input amax is copyable bit-for-bit and the weight amax is recomputable from the source weights. -> fully NVFP4. - Dense/shared
gate_up_projis a single fused[gate; up]linear with no twin to copy an input amax from. Fabricating one would be guessing, so it was left BF16. Thedown_projin those same MLPs was never matched by the buggy glob, stayed calibrated, and is NVFP4 — hence the split precision within those two MLPs.
Net: the routed experts (the overwhelming majority of parameters) are fully NVFP4; only the handful of dense/shared gate_up projections are BF16 — conservative for quality, a few GB of size. For uniform precision, recalibrate from scratch with calibration/data/ (the exclusion glob is fixed in the recipe).
Serving (Docker)
A ready-to-run SM120 serving image is published — b12x stack, MSA sparse attention, swigluoai-correct MoE, with the known-good transformers baked in:
docker pull verdictai/minimax-m3-nvfp4-b12x:v1
Download this repo's weights, then mount them at /model:
docker run --gpus all --shm-size 32g -p 9211:9211 \
-v /path/to/MiniMax-M3-NVFP4:/model \
verdictai/minimax-m3-nvfp4-b12x:v1
Serves an OpenAI-compatible endpoint on :9211 as minimax-m3-nvfp4 (TP4, modelopt_fp4, 262K context, vision input up to 4 images/prompt). Tunable via env: TP, GPU_UTIL (default 0.93), MAXLEN (262144), MAXSEQS (16), PORT (9211).
Requirements
This is the VL architecture (model_type: minimax_m3_vl, MiniMaxM3SparseForConditionalGeneration) with MiniMax Sparse Attention. Known-good stack (baked into the Docker image above):
| Component | Version |
|---|---|
| transformers | 5.10.2 |
| vLLM | b12x SM120 build (0.11.2.dev…b12x, MSA + swigluoai MoE) |
| GPUs | 4x 96 GB SM120, TP4 |
If you serve with your own stack rather than the image, pin transformers==5.10.2 and use a vLLM build that implements MiniMax-M3 (VL). Stock vLLM does not have this architecture, and a mismatched transformers rejects the config — see Troubleshooting.
Troubleshooting
ValueError: The layer_types entries must be in (...) but got [... 'minimax_m3_sparse' ...] (transformers validate_layer_type)
The M3 sparse layers are typed minimax_m3_sparse, which a stock/newer transformers' validate_layer_type does not accept. This is a stack/version mismatch, not a corrupt file — the model loads and serves correctly on the known-good stack above (transformers 5.10.2 + the b12x vLLM build, i.e. the Docker image). Fix: use the Docker image, or pin transformers==5.10.2 and a MiniMax-M3-aware vLLM build. Do not rename minimax_m3_sparse to an "allowed" type just to silence the validator — that string selects the sparse-attention implementation, so a stack that doesn't implement it will mis-route or compute wrong.
Optional hardening for plain-transformers / trust_remote_code loaders. This repo ships configuration_minimax_m3_vl.py, a compatibility config that remaps the text backbone to a standard type so stock transformers can parse the config. To let trust_remote_code=True pick it up automatically, add an auto_map to config.json:
"auto_map": {
"AutoConfig": "configuration_minimax_m3_vl.MiniMaxM3VLConfig"
}
The provided Docker image already serves correctly (its bundled harness registers the config), so this auto_map is only needed for non-image, plain-transformers loading paths.
Reproduction
Full conversion recipe (pipeline design, split-executor amax merge, and all gotchas): brandonmmusic-max/minimax-m3-nvfp4-recipe. Ready-to-run serving image: verdictai/minimax-m3-nvfp4-b12x:v1 (see Serving).
Calibration
ModelOpt 0.44 max-calibration through a per-expert-fidelity pipeline (each routed expert calibrated as an independent linear, then re-fused for export): 445 batches / ~136M tokens at 300K tokens per batch across three corpora (long-form deep reasoning @8K, diverse text @4K, agentic coding @4K), executed on 8x B200. Master amax tensors are included under calibration/ for reproducibility and re-export.
msa_golden/ contains SM100 (B200) golden-oracle captures of the MSA attention kernels (dense+maxscore, top-k select, block-sparse attention, NVFP4-KV, fp8 decode, LSE conventions) used to validate an SM120 port of MiniMax sparse attention — kept here as provenance for downstream kernel work.
Activation note (swigluoai)
M3 experts use the GPT-OSS-style clamped SwiGLU: (clamp(up,±7)+1) · clamp(gate,max=7) · σ(1.702·gate). Serving stacks must implement this exactly — kernels that silently fall back to plain SiLU·up will degrade quality on all 57 MoE layers. Verify your MoE backend honors swiglu_limit/alpha (or probe it numerically at startup).
Quantized with TensorRT Model Optimizer via a streaming per-expert calibration pipeline. Base model by MiniMax; MIT license per the base model.
Acknowledgements
- Luke Alonso (HuggingFace · GitHub) — author of the b12x SM120 kernel stack used for serving and the per-expert quant-toolkit calibration pipeline this conversion is built on.
- local-inference-lab/quant-toolkit — the streaming per-expert calibration toolkit and the open-source calibration corpora (
deep_calib,diverse_calib,agentic_coding_calib*) included here undercalibration/data/. - MiniMax — the base model, MiniMax-M3.
Files in this repo
- weight shards + index, full VL sidecars (image/video processors, chat template, tokenizer),
hf_quant_config.json calibration/m3_merged_amax_gatefix.safetensors— amax statistics that regenerate this export in ~25 min with no recalibrationcalibration/data/*.jsonl— the quant-toolkit calibration corpora (~225 MB), to recalibrate from scratchmsa_golden/— SM100 kernel golden oracles (provenance for the SM120 MSA port)
Full reproduction recipe & code: https://github.com/brandonmmusic-max/minimax-m3-nvfp4-recipe
- Downloads last month
- 591
Model tree for brandonmusic/MiniMax-M3-NVFP4
Base model
MiniMaxAI/MiniMax-M3