Instructions to use ox-ox/MiniMax-M2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ox-ox/MiniMax-M2.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ox-ox/MiniMax-M2.5-GGUF",
	filename="minimax-m2.5-Q3_K_L.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use ox-ox/MiniMax-M2.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
# Run inference directly in the terminal:
llama-cli -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
# Run inference directly in the terminal:
llama-cli -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
# Run inference directly in the terminal:
./llama-cli -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

Use Docker

docker model run hf.co/ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

LM Studio
Jan

vLLM

How to use ox-ox/MiniMax-M2.5-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ox-ox/MiniMax-M2.5-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ox-ox/MiniMax-M2.5-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

Ollama
How to use ox-ox/MiniMax-M2.5-GGUF with Ollama:
```
ollama run hf.co/ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
```

Unsloth Studio

How to use ox-ox/MiniMax-M2.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ox-ox/MiniMax-M2.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ox-ox/MiniMax-M2.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ox-ox/MiniMax-M2.5-GGUF to start chatting

How to use ox-ox/MiniMax-M2.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ox-ox/MiniMax-M2.5-GGUF:Q3_K_L"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ox-ox/MiniMax-M2.5-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

Run Hermes

hermes

Docker Model Runner
How to use ox-ox/MiniMax-M2.5-GGUF with Docker Model Runner:
```
docker model run hf.co/ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
```

Lemonade

How to use ox-ox/MiniMax-M2.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ox-ox/MiniMax-M2.5-GGUF:Q3_K_L

Run and chat with the model

lemonade run user.MiniMax-M2.5-GGUF-Q3_K_L

List all available models

lemonade list

SHARE YOUR PERPLEXITY RESULTS

by ox-ox - opened Feb 13

Discussion

ox-ox

Owner Feb 13

Just ran PPL on my Q3_K_L (110.22 GiB). Got a Final PPL of 8.2213 (+/- 0.09) on WikiText-2. It seems that going the FP8 -> F16 Master -> Q3_K_L route really paid off compared to standard quants. It beats the IQ4_XS efficiency curve while fitting perfectly in 128GB RAM at 28.7 t/s.

ubergarm

Feb 13

Exact command and details on hardware backend are important. See here for more discussions: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/3

I'll test your quant and let you know how it fares against mine.

ox-ox

Owner Feb 13

Thanks for jumping in!

Yes, I just replied on your thread as well—I realized I was running -c 4096 while you were on -c 512, which explains why my initial PPL (8.22) looked "too good to be true" compared to the baseline.

I am currently re-running the test with your exact parameters (-c 512, -b 2048) on my M3 Max to get an apples-to-apples comparison. ETA is ~15 mins.

Having you test the file independently on your backend would be amazing. If it holds up on your rig, that validates the Q3_K_L as a solid daily driver for 128GB users!

mayhem4markets

Feb 13

Happy to share results, just let me know if there is a specific command that I should utilize.

ox-ox

Owner Feb 13

yess download this and run this command inside your bin : https://drive.google.com/file/d/1wXEK2rhNEeaIL94JiYtC8LdjPC9LNk8j/view?usp=sharing
and run ./llama-perplexity
-m -m /Users/xx/llama.cpp/models/t1.gguf
-f wiki.test.raw
-c 512
-b 2048
--seed 1337
-ngl 99

ubergarm

Feb 13

Yeah, the default mainline llama.cpp recipes can't compete with custom recipes and especially ik_llama.cpp SOTA quantization types.

Great job getting some quants out quickly though, minmaxing recipes is an exciting and unique hobby lmao... Catch you on Beaver AI holler at me!

ox-ox

Owner Feb 13

•

edited Feb 13

This graph is gold. Thanks for running the benchmarks and plotting the data points!

It clearly shows the trade-off: your IQ4_XS custom mix is indeed the quality king (better PPL), while my Q3_K_L sits as the "Mainline/Standard" option for those who need to save that extra ~5GB of VRAM or want to stick to the default llama.cpp builds without custom forks.

I'm more than happy to lose the PPL battle to learn from the best. I'll definitely take you up on the invite to Beaver AI, I have a lot to learn about ik_llama recipes. See you there!

ubergarm

Feb 13

your IQ4_XS custom mix is indeed the quality king (better PPL), while my Q3_K_L sits as the "Mainline/Standard" option for those who need to save that extra ~5GB of VRAM or want to stick to the default llama.cpp builds without custom forks.

Thanks, and to be clear on a few points here again:

The IQ4_XS is a standard llama.cpp compatible quant and requires no special custom forks, it just works. Custom quants are built into mainline llama.cpp.
Many people are going to use this model for hybrid CPU + GPU so DRAM + VRAM. They might save ~5GB of DRAM not VRAM. Your quant does have smaller attn.* tensors relative to mine so they might see 10% faster TG with your quant given that is memory bandwidth limited in most cases. They would only save maybe ~1GB VRAM using the standard --cpu-moe flag.

I know I'm kind of pedantic sorry not sorry lol.

Cheers and thanks for your attention and keep on learning you'll be cranking out custom quants optimizing that trade-off of quality and speed across various target hardware platforms!

ox-ox

Owner Feb 13

Pedantic is exactly what I need right now! No apology needed, his is how I learn.

The Data (Final Run):
I just finished the run with your exact parameters (-c 512, --seed 1337) on M3 Max:
Final estimate: PPL = 8.7948 +/- 0.07100

So the final scoreboard is:

Your IQ4_XS Mix: ~8.57 PPL (Clear winner on Quality/Reasoning)

My Q3_K_L: ~8.79 PPL (+0.22 delta)

The Speed/Bandwidth Theory:
You nailed it on the bandwidth limitation. My attn.* tensors being smaller (Q3 vs your Q8) likely contributes to the ~28.7 t/s I'm seeing. It feels very snappy for a 230B model on a single machine.

Thanks for clarifying the DRAM vs VRAM distinction and the status of IQ quants in mainline. I'll update my mental model (and my repo description) accordingly.

I'm off to watch your talk and dive into the tensor-level quantization docs. Thanks for the crash course tonight!

ubergarm

Feb 13

Your value is now closer to what I measured for your: Q3_K_L 8.8377 +/- 0.07155

The backend can have an effect, which is another reason I like to measure
them all on the exact same hardware. But definitely more in-line with
what I might expect. Here are my full values click into the fold:

👈 Details

[
  {
    "name": "BF16",
    "ppl": "8.3386 +/- 0.06651",
    "size": 426.060,
    "bpw": 16.003,
    "legend": "full quality",
    "comment": "",
    "skip": true
  },
  {
    "name": "Q8_0",
    "ppl": "8.3590 +/- 0.06673",
    "size": 226.431,
    "bpw": 8.505,
    "legend": "full quality",
    "comment": "should be full quality as original is fp8",
    "skip": true
  },
  {
    "name": "IQ5_K",
    "ppl": "8.4860 +/- 0.06815",
    "size": 157.771,
    "bpw": 5.926,
    "legend": "ubergarm",
    "comment": ""
  },
  {
    "name": "IQ4_XS\n(mainline compatible)",
    "ppl": "8.5702 +/- 0.06901",
    "size": 114.842,
    "bpw": 4.314,
    "legend": "ubergarm",
    "comment": "smol, with imatrix, mainline compat quant"
  },
  {
    "name": "Q3_K_L",
    "ppl": "8.8377 +/- 0.07155",
    "size": 110.215,
    "bpw": 4.140,
    "legend": "ox-ox",
    "comment": "https://huggingface.co/ox-ox/MiniMax-M2.5-GGUF"
  },
  {
    "name": "smol-IQ3_KS",
    "ppl": "8.7539 +/- 0.07075",
    "size": 87.237,
    "bpw": 3.277,
    "legend": "ubergarm",
    "comment": "full q8_0 attn.*"
  },
  {
    "name": "derp-smol-IQ3_KS",
    "ppl": "8.8293 +/- 0.07164",
    "size": 86.641,
    "bpw": 3.254,
    "legend": "unreleased",
    "comment": "iq6_k attn.*"
  },
  {
    "name": "IQ2_KS",
    "ppl": "9.6827 +/- 0.07972",
    "size": 69.800,
    "bpw": 2.622,
    "legend": "ubergarm",
    "comment": "full q8_0 attn.*"
  }
]

ox-ox

Owner Feb 13

Awesome to see the Q3_K_L officially in the mix! Thanks for the bench and the inclusion in the table. The delta is exactly what I expected. Back to the lab to check out your talk now. Cheers!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment