hipfire-MiniMax-M2.7

hipfire-native quantizations of MiniMax-M2.7 — a 229B-parameter (≈10B active) MoE: GQA attention with per-layer QK-norm, partial rotate_half RoPE, and a 256-expert top-8 sigmoid+bias router (DeepSeek-V3-style), SwiGLU experts, no shared expert. 62 layers, hidden 3072, vocab 200064.

These .mq* files run with the hipfire inference engine (HIP/ROCm-direct, no Python in the hot path). They are not GGUF/safetensors and won't load in llama.cpp / transformers. Attention, router, embedding, and lm_head are kept at Q8 throughout; only the routed-expert precision varies below, and gate/up AWQ is baked in on every quantized tier.

Files

A 5-tier quality/size spectrum. Each routed expert is gate/up · down; the down projection is promoted independently (see below).

File gate/up · down Size Notes
MiniMax-M2.7.mq4 MQ4 · MQ4 123.6 GB best quality
MiniMax-M2.7.mq3-lloyd MQ3-Lloyd+AWQ · MQ4 109.6 GB high quality
MiniMax-M2.7.mq3 MQ3-Lloyd uniform 102.6 GB balanced 3-bit
MiniMax-M2.7.mq2-lloyd MQ2-Lloyd+AWQ · MQ4 86.2 GB fits 128 GB unified (Strix Halo / gfx1151)
MiniMax-M2.7.mq2 MQ2-Lloyd+AWQ · MQ3-Lloyd 79.2 GB smallest coherent

Why the down projection is promoted

MiniMax-M2 has no shared expert, so routed-expert quantization error has no high-precision path to hide behind, and uniform 2-bit experts collapse into repetition loops. The cause is specific: the down projection's input (the SwiGLU intermediate) carries ~588× the activation energy of the gate/up input, so a low-bit error in w2 is amplified ~24× more at the block output. It behaves like a precision-critical conv1d — it doesn't degrade gracefully.

The fix (in every sub-mq4 tier here): keep gate/up cheap (2- or 3-bit + AWQ) but promote only the down projection. Promoting down to as little as 3-bit restores coherence; 4-bit and 6-bit progressively add richness. Per-expert promotion does not help — MiniMax's router is load-balanced (no dominant experts to anchor), so the lever is the projection, not the expert.

Validation

Forward pass validated against a PyTorch (transformers MiniMaxM2) reference on a dimension-faithful tiny random-weight oracle (per-layer hidden-state cosine 0.9996 mq4 / 0.9994 mq3-lloyd; attention isolated 0.99990; routing exact). Real 229B generation is coherent on all five tiers (greedy): "why is the sky blue" → correct Rayleigh-scattering answer; "factorial of n" → correct Python. Greedy decode can show mild repetition on open-ended prompts — use light sampling + a repeat penalty + the chat template for production.

Quantization

Source: the official FP8 (E4M3 + F32 block-[128,128] weight_scale_inv) checkpoint, dequantized to F32 then re-quantized. gate/up AWQ uses a shared per-layer activation scale from an unsloth imatrix (math W·s @ x/s = W·x is exact). Weights are stored at the listed bit-width and dequantized to f32 on the fly inside the GEMV kernels — the math is f32; the bit-width is a storage/bandwidth dial. The MQ3-Lloyd indexed-decode MoE GEMV kernel and the per-projection mixed-dtype dispatch were added to hipfire for this model.

Usage (hipfire)

# Chat serving — the chat template is embedded in every tier:
HIPFIRE_JINJA_CHAT=1 hipfire serve --model MiniMax-M2.7.mq2-lloyd
# or a single raw forward (no chat template):
# examples/infer_minimax --model MiniMax-M2.7.mq2-lloyd --prompt "..."

arch_id 10 in the hipfire HFQ header.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hipfire-models/hipfire-MiniMax-M2.7

Finetuned
(28)
this model