hipfire-MiniMax-M2.7
hipfire-native quantizations of MiniMax-M2.7 — a 229B-parameter (≈10B active) MoE: GQA attention with per-layer QK-norm, partial rotate_half RoPE, and a 256-expert top-8 sigmoid+bias router (DeepSeek-V3-style), SwiGLU experts, no shared expert. 62 layers, hidden 3072, vocab 200064.
These .mq* files run with the hipfire inference engine (HIP/ROCm-direct, no
Python in the hot path). They are not GGUF/safetensors and won't load in
llama.cpp / transformers. Attention, router, embedding, and lm_head are kept at
Q8 throughout; only the routed-expert precision varies below, and gate/up
AWQ is baked in on every quantized tier.
Files
A 5-tier quality/size spectrum. Each routed expert is gate/up · down; the
down projection is promoted independently (see below).
| File | gate/up · down | Size | Notes |
|---|---|---|---|
MiniMax-M2.7.mq4 |
MQ4 · MQ4 | 123.6 GB | best quality |
MiniMax-M2.7.mq3-lloyd |
MQ3-Lloyd+AWQ · MQ4 | 109.6 GB | high quality |
MiniMax-M2.7.mq3 |
MQ3-Lloyd uniform | 102.6 GB | balanced 3-bit |
MiniMax-M2.7.mq2-lloyd |
MQ2-Lloyd+AWQ · MQ4 | 86.2 GB | fits 128 GB unified (Strix Halo / gfx1151) |
MiniMax-M2.7.mq2 |
MQ2-Lloyd+AWQ · MQ3-Lloyd | 79.2 GB | smallest coherent |
Why the down projection is promoted
MiniMax-M2 has no shared expert, so routed-expert quantization error has no
high-precision path to hide behind, and uniform 2-bit experts collapse into
repetition loops. The cause is specific: the down projection's input (the
SwiGLU intermediate) carries ~588× the activation energy of the gate/up
input, so a low-bit error in w2 is amplified ~24× more at the block output. It
behaves like a precision-critical conv1d — it doesn't degrade gracefully.
The fix (in every sub-mq4 tier here): keep gate/up cheap (2- or 3-bit + AWQ) but promote only the down projection. Promoting down to as little as 3-bit restores coherence; 4-bit and 6-bit progressively add richness. Per-expert promotion does not help — MiniMax's router is load-balanced (no dominant experts to anchor), so the lever is the projection, not the expert.
Validation
Forward pass validated against a PyTorch (transformers MiniMaxM2) reference
on a dimension-faithful tiny random-weight oracle (per-layer hidden-state cosine
0.9996 mq4 / 0.9994 mq3-lloyd; attention isolated 0.99990; routing
exact). Real 229B generation is coherent on all five tiers (greedy): "why is the
sky blue" → correct Rayleigh-scattering answer; "factorial of n" → correct
Python. Greedy decode can show mild repetition on open-ended prompts — use light
sampling + a repeat penalty + the chat template for production.
Quantization
Source: the official FP8 (E4M3 + F32 block-[128,128] weight_scale_inv)
checkpoint, dequantized to F32 then re-quantized. gate/up AWQ uses a shared
per-layer activation scale from an unsloth imatrix (math W·s @ x/s = W·x is
exact). Weights are stored at the listed bit-width and dequantized to f32 on
the fly inside the GEMV kernels — the math is f32; the bit-width is a
storage/bandwidth dial. The MQ3-Lloyd indexed-decode MoE GEMV kernel and the
per-projection mixed-dtype dispatch were added to hipfire for this model.
Usage (hipfire)
# Chat serving — the chat template is embedded in every tier:
HIPFIRE_JINJA_CHAT=1 hipfire serve --model MiniMax-M2.7.mq2-lloyd
# or a single raw forward (no chat template):
# examples/infer_minimax --model MiniMax-M2.7.mq2-lloyd --prompt "..."
arch_id 10 in the hipfire HFQ header.
Model tree for hipfire-models/hipfire-MiniMax-M2.7
Base model
MiniMaxAI/MiniMax-M2.7