MiniMax-M3 — NVFP4

NVFP4 quantization of MiniMaxAI/MiniMax-M3: 428B-parameter (23B active) multimodal MoE with MiniMax Sparse Attention (MSA), 1M-token context. 259 GB — runs on 4x 96 GB GPUs at TP4 with headroom for KV cache.

What is quantized

Component Precision
Routed experts — gate_proj / up_proj / down_proj (128 experts x 57 layers) NVFP4 (block-16 E2M1, FP8-E4M3 block scales, FP32 global scale)
Dense-MLP (layers 0-2) & shared-expert down_proj NVFP4
Dense-MLP (layers 0-2) & shared-expert gate_up_proj BF16 — see note below
Attention (q/k/v/o, all 60 layers) BF16
Router (mlp.gate + e_score_correction_bias) BF16 / FP32
Embeddings, lm_head, all norms BF16
Vision tower + multimodal projector (full VL stack) BF16 — fully preserved

The repo is a complete VL model: image/video processors, chat template, and tokenizer ship alongside the weights.

Why dense/shared gate_up_proj is BF16 (an honest note, not hand-waving)

This is a scar from a calibration-config bug, kept and explained rather than papered over.

The first export used an over-broad exclusion glob *gate* — meant only for the MoE router (mlp.gate) — which silently disabled the quantizers for every gate-named module: the routed experts' gate_proj and the dense/shared gate_up_proj. That produced a 456 GB "NVFP4" export still carrying ~275 GB of un-quantized BF16 gates.

Rather than spend ~7 h recalibrating, the amaxes were recovered post-hoc — and the recovery is only exact for some of them:

  • Routed gate_proj has an exact twin: its sibling up_proj consumes the identical input tensor, so the input amax is copyable bit-for-bit and the weight amax is recomputable from the source weights. -> fully NVFP4.
  • Dense/shared gate_up_proj is a single fused [gate; up] linear with no twin to copy an input amax from. Fabricating one would be guessing, so it was left BF16. The down_proj in those same MLPs was never matched by the buggy glob, stayed calibrated, and is NVFP4 — hence the split precision within those two MLPs.

Net: the routed experts (the overwhelming majority of parameters) are fully NVFP4; only the handful of dense/shared gate_up projections are BF16 — conservative for quality, a few GB of size. For uniform precision, recalibrate from scratch with calibration/data/ (the exclusion glob is fixed in the recipe).

Serving (Docker)

A ready-to-run SM120 serving image is published — b12x stack, MSA sparse attention, swigluoai-correct MoE, with the known-good transformers baked in:

docker pull verdictai/minimax-m3-nvfp4-b12x:v1

Download this repo's weights, then mount them at /model:

docker run --gpus all --shm-size 32g -p 9211:9211 \
  -v /path/to/MiniMax-M3-NVFP4:/model \
  verdictai/minimax-m3-nvfp4-b12x:v1

Serves an OpenAI-compatible endpoint on :9211 as minimax-m3-nvfp4 (TP4, modelopt_fp4, 262K context, vision input up to 4 images/prompt). Tunable via env: TP, GPU_UTIL (default 0.93), MAXLEN (262144), MAXSEQS (16), PORT (9211).

Requirements

This is the VL architecture (model_type: minimax_m3_vl, MiniMaxM3SparseForConditionalGeneration) with MiniMax Sparse Attention. Known-good stack (baked into the Docker image above):

Component Version
transformers 5.10.2
vLLM b12x SM120 build (0.11.2.dev…b12x, MSA + swigluoai MoE)
GPUs 4x 96 GB SM120, TP4

If you serve with your own stack rather than the image, pin transformers==5.10.2 and use a vLLM build that implements MiniMax-M3 (VL). Stock vLLM does not have this architecture, and a mismatched transformers rejects the config — see Troubleshooting.

Troubleshooting

ValueError: The layer_types entries must be in (...) but got [... 'minimax_m3_sparse' ...] (transformers validate_layer_type)

The M3 sparse layers are typed minimax_m3_sparse, which a stock/newer transformers' validate_layer_type does not accept. This is a stack/version mismatch, not a corrupt file — the model loads and serves correctly on the known-good stack above (transformers 5.10.2 + the b12x vLLM build, i.e. the Docker image). Fix: use the Docker image, or pin transformers==5.10.2 and a MiniMax-M3-aware vLLM build. Do not rename minimax_m3_sparse to an "allowed" type just to silence the validator — that string selects the sparse-attention implementation, so a stack that doesn't implement it will mis-route or compute wrong.

Optional hardening for plain-transformers / trust_remote_code loaders. This repo ships configuration_minimax_m3_vl.py, a compatibility config that remaps the text backbone to a standard type so stock transformers can parse the config. To let trust_remote_code=True pick it up automatically, add an auto_map to config.json:

"auto_map": {
  "AutoConfig": "configuration_minimax_m3_vl.MiniMaxM3VLConfig"
}

The provided Docker image already serves correctly (its bundled harness registers the config), so this auto_map is only needed for non-image, plain-transformers loading paths.

Reproduction

Full conversion recipe (pipeline design, split-executor amax merge, and all gotchas): brandonmmusic-max/minimax-m3-nvfp4-recipe. Ready-to-run serving image: verdictai/minimax-m3-nvfp4-b12x:v1 (see Serving).

Calibration

ModelOpt 0.44 max-calibration through a per-expert-fidelity pipeline (each routed expert calibrated as an independent linear, then re-fused for export): 445 batches / ~136M tokens at 300K tokens per batch across three corpora (long-form deep reasoning @8K, diverse text @4K, agentic coding @4K), executed on 8x B200. Master amax tensors are included under calibration/ for reproducibility and re-export.

msa_golden/ contains SM100 (B200) golden-oracle captures of the MSA attention kernels (dense+maxscore, top-k select, block-sparse attention, NVFP4-KV, fp8 decode, LSE conventions) used to validate an SM120 port of MiniMax sparse attention — kept here as provenance for downstream kernel work.

Activation note (swigluoai)

M3 experts use the GPT-OSS-style clamped SwiGLU: (clamp(up,±7)+1) · clamp(gate,max=7) · σ(1.702·gate). Serving stacks must implement this exactly — kernels that silently fall back to plain SiLU·up will degrade quality on all 57 MoE layers. Verify your MoE backend honors swiglu_limit/alpha (or probe it numerically at startup).

Quantized with TensorRT Model Optimizer via a streaming per-expert calibration pipeline. Base model by MiniMax; MIT license per the base model.

Acknowledgements

  • Luke Alonso (HuggingFace · GitHub) — author of the b12x SM120 kernel stack used for serving and the per-expert quant-toolkit calibration pipeline this conversion is built on.
  • local-inference-lab/quant-toolkit — the streaming per-expert calibration toolkit and the open-source calibration corpora (deep_calib, diverse_calib, agentic_coding_calib*) included here under calibration/data/.
  • MiniMax — the base model, MiniMax-M3.

Files in this repo

  • weight shards + index, full VL sidecars (image/video processors, chat template, tokenizer), hf_quant_config.json
  • calibration/m3_merged_amax_gatefix.safetensors — amax statistics that regenerate this export in ~25 min with no recalibration
  • calibration/data/*.jsonl — the quant-toolkit calibration corpora (~225 MB), to recalibrate from scratch
  • msa_golden/ — SM100 kernel golden oracles (provenance for the SM120 MSA port)

Full reproduction recipe & code: https://github.com/brandonmmusic-max/minimax-m3-nvfp4-recipe

Downloads last month
591
Safetensors
Model size
246B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brandonmusic/MiniMax-M3-NVFP4

Quantized
(17)
this model