Instructions to use JANGQ-AI/MiniMax-M2.7-JANG_6M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JANGQ-AI/MiniMax-M2.7-JANG_6M with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("JANGQ-AI/MiniMax-M2.7-JANG_6M")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use JANGQ-AI/MiniMax-M2.7-JANG_6M with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANG_6M"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "JANGQ-AI/MiniMax-M2.7-JANG_6M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use JANGQ-AI/MiniMax-M2.7-JANG_6M with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANG_6M"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default JANGQ-AI/MiniMax-M2.7-JANG_6M

Run Hermes

hermes

MLX LM

How to use JANGQ-AI/MiniMax-M2.7-JANG_6M with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "JANGQ-AI/MiniMax-M2.7-JANG_6M"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANG_6M"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "JANGQ-AI/MiniMax-M2.7-JANG_6M",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

MiniMax-M2.7 JANG_6M

MiniMax M2.7 228B MoE — 6.03-bit mixed precision, 167 GB

Near-lossless quantization for maximum quality on Apple Silicon.

Recommended: Run in MLX Studio for best experience including thinking mode support and optimized MoE inference.

Important Settings

MiniMax M2.7 is an always-reasoning model. It thinks before answering on every prompt.

Setting	Value	Notes
Temperature	1.0	REQUIRED — greedy/temp=0 causes infinite thinking loops
Top P	0.95
Top K	40
Repetition Penalty	1.1	Optional, helps prevent loops

Model Details

Metric	Value
Source	`MiniMaxAI/MiniMax-M2.7` (FP8 E4M3)
Architecture	MoE (256 experts, top-8 active), GQA (48 heads / 8 KV), partial RoPE
Total Parameters	228.7B
Active Parameters	~1.4B per token
Profile	JANG_6M (CRITICAL=8-bit, IMPORTANT=6-bit, COMPRESS=6-bit)
Actual avg bits	6.03
Model size	167 GB
Format	JANG v2 (MLX-native safetensors, instant load)
group_size	128 (speed-optimized for 256 experts)
Routing	Sigmoid + bias correction (not softmax)
QK-norm	Full vector RMSNorm
Context	192K tokens

JANG_6M Bit Allocation

Tier	Components	Bits
CRITICAL	Attention (Q/K/V/O), lm_head	8
IMPORTANT	Embeddings	6
COMPRESS	Expert MLP (w1/w2/w3) — 98%+ of params	6
Passthrough	MoE router/gate (float16), norms, QK-norms	16

JANG protects routing and attention at full precision while compressing the 256 expert MLPs — where MoE models are most tolerant of quantization. The router is kept at float16 (no quantization) for maximum routing precision.

MMLU Benchmarks (200q, 10 subjects, reasoning ON)

Coming soon — benchmarks in progress.

Why JANG for MiniMax

Standard MLX quantization on MiniMax produces completely broken output at ALL bit levels (~25% MMLU = random guessing). JANG's mixed-precision approach is the only working quantized MiniMax on Apple Silicon.

On M2.5, JANG_2L achieved 74% MMLU vs MLX's 25% (random). M2.7 results pending.

All Quantizations

Model	Profile	Size	Avg Bits
JANG_2L	(8, 6, 2)	63 GB	2.10
JANG_3L	(8, 4, 3)	89 GB	3.08
JANG_4M	(8, 4, 4)	115 GB	4.06
JANG_6M	(8, 6, 6)	167 GB	6.03

Requirements

Apple Silicon Mac with 192 GB unified memory
MLX framework
MLX Studio recommended

Tool Use / Agent Mode

MiniMax M2.7 uses interleaved thinking + tool calls — it reasons inside <think> blocks, then emits tool calls in <minimax:tool_call> format. Some clients (Opencode, etc.) may strip the <think> block and miss the tool call.

For tool-use clients, set enable_thinking=False in the chat template:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False  # skips <think> injection for tool-use
)

MiniMax tool call format:

<minimax:tool_call>
<invoke name="tool_name">
<parameter name="param1">value1</parameter>
</invoke>
</minimax:tool_call>

Usage

from jang_tools.loader import load_jang_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_jang_model("JANGQ-AI/MiniMax-M2.7-JANG_6M")
sampler = make_sampler(temp=1.0, top_p=0.95)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is photosynthesis?"}],
    tokenize=False, add_generation_prompt=True
)
output = generate(model, tokenizer, prompt=prompt, max_tokens=2048, sampler=sampler)
print(output)

Support

MLX Studio | JANGQ | X @dealignai

Quantized by Jinho Jang (eric@jangq.ai) using JANG Tools v2.4.1.

This model is provided for research and personal use. Users are responsible for ensuring their use complies with applicable laws and the MiniMax license.

Downloads last month: 476

Safetensors

Model size

47B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized