Instructions to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="DevQuasar/MiniMaxAI.MiniMax-M2-GGUF",
	filename="MiniMaxAI.MiniMax-M2.IQ1_M-00001-of-00004.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

Use Docker

docker model run hf.co/DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DevQuasar/MiniMaxAI.MiniMax-M2-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DevQuasar/MiniMaxAI.MiniMax-M2-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

Ollama
How to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with Ollama:
```
ollama run hf.co/DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M
```

Unsloth Studio new

How to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DevQuasar/MiniMaxAI.MiniMax-M2-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DevQuasar/MiniMaxAI.MiniMax-M2-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for DevQuasar/MiniMaxAI.MiniMax-M2-GGUF to start chatting

Pi new

How to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with Docker Model Runner:
```
docker model run hf.co/DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M
```

Lemonade

How to use DevQuasar/MiniMaxAI.MiniMax-M2-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull DevQuasar/MiniMaxAI.MiniMax-M2-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.MiniMaxAI.MiniMax-M2-GGUF-Q4_K_M

List all available models

lemonade list

Seems to be working with PR and `--jinja`

by ubergarm - opened Oct 30, 2025

Discussion

ubergarm

Oct 30, 2025

Without --jinja it seemed to go off the rails and spit out chat template tokens when testing the chat completions endpoint.

This makes sense given the PR says:

not doing the chat template yet because not sure how to handle the interleaving thinking blocks.

https://github.com/ggml-org/llama.cpp/pull/16831

Though with --jinja, it seems to work okay in my limited testing and seems to be in "thinking" mode. Getting over 20 tok/sec generation in short context on CPU-only (one big socket with AMD EPYC 9975 with 768GB DDR5-6400MT/s in NPS1):

model=/mnt/raid/models/DevQuasar/MiniMaxAI.MiniMax-M2-GGUF/MiniMaxAI.MiniMax-M2.Q8_0-00001-of-00019.gguf
SOCKET=0

numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
    --model "$model"\
    --alias DevQuasar/MiniMax-M2-GGUF \
    --ctx-size 32768 \
    -fa 1 \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja \
    --temp 1.0 \
    --top-k 40 \
    --top-p 0.95

print_info: file type   = Q8_0
print_info: file size   = 226.43 GiB (8.51 BPW)

Also note this model is a bit funky and wants "interleaved thinking" which is likely why the chat template issues:

IMPORTANT: MiniMax-M2 is an interleaved thinking model. Therefore, when using it, it is important to retain the thinking content from the assistant's turns within the historical messages. In the model's output content, we use the ... format to wrap the assistant's thinking content. When using the model, you must ensure that the historical content is passed back in its original format. Do not remove the ... part, otherwise, the model's performance will be negatively affected.

https://huggingface.co/MiniMaxAI/MiniMax-M2#inference-parameters

Thanks for making these available for testing!

ubergarm

Oct 30, 2025

I ran perplexity on the q8_0 with my usual 512 context size wiki.test.raw full ~1.3MB file:

model=/mnt/raid/models/DevQuasar/MiniMaxAI.MiniMax-M2-GGUF/MiniMaxAI.MiniMax-M2.Q8_0-00001-of-00019.gguf
SOCKET=0
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    -fa 1 \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa numactl \
    --threads 96 \
    --threads-batch 128 \
    --no-mmap

...

Final estimate: PPL = 6.9930 +/- 0.04889

csabakecskemeti

DevQuasar org Oct 30, 2025

What I've seen it's often skipping the but only at the first response. Later seemingly it uses correctly

lsm03624

Oct 31, 2025

I downloaded the latest source code for llama.cpp, compiled it, but it still doesn’t work. Could it be that the code hasn’t been integrated into the main project? The error message is: “llama_model_load: error loading model; error loading model architecture: unknown model architecture: ‘minimax-m2’”. Additionally, “llama_model_load_from_file_impl: failed to load model” is also displayed.

csabakecskemeti

DevQuasar org Oct 31, 2025

Build llama.cpp from the branch. The PR has not merged yet
https://github.com/ggml-org/llama.cpp/pull/16831

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment