Instructions to use ox-ox/MiniMax-M2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ox-ox/MiniMax-M2.5-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ox-ox/MiniMax-M2.5-GGUF", filename="minimax-m2.5-Q3_K_L.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ox-ox/MiniMax-M2.5-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L # Run inference directly in the terminal: llama-cli -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L # Run inference directly in the terminal: llama-cli -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L # Run inference directly in the terminal: ./llama-cli -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L # Run inference directly in the terminal: ./build/bin/llama-cli -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
Use Docker
docker model run hf.co/ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
- LM Studio
- Jan
- vLLM
How to use ox-ox/MiniMax-M2.5-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ox-ox/MiniMax-M2.5-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ox-ox/MiniMax-M2.5-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
- Ollama
How to use ox-ox/MiniMax-M2.5-GGUF with Ollama:
ollama run hf.co/ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
- Unsloth Studio
How to use ox-ox/MiniMax-M2.5-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ox-ox/MiniMax-M2.5-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ox-ox/MiniMax-M2.5-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ox-ox/MiniMax-M2.5-GGUF to start chatting
- Pi
How to use ox-ox/MiniMax-M2.5-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ox-ox/MiniMax-M2.5-GGUF:Q3_K_L" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ox-ox/MiniMax-M2.5-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
Run Hermes
hermes
- Docker Model Runner
How to use ox-ox/MiniMax-M2.5-GGUF with Docker Model Runner:
docker model run hf.co/ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
- Lemonade
How to use ox-ox/MiniMax-M2.5-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ox-ox/MiniMax-M2.5-GGUF:Q3_K_L
Run and chat with the model
lemonade run user.MiniMax-M2.5-GGUF-Q3_K_L
List all available models
lemonade list
SHARE YOUR PERPLEXITY RESULTS
Just ran PPL on my Q3_K_L (110.22 GiB). Got a Final PPL of 8.2213 (+/- 0.09) on WikiText-2. It seems that going the FP8 -> F16 Master -> Q3_K_L route really paid off compared to standard quants. It beats the IQ4_XS efficiency curve while fitting perfectly in 128GB RAM at 28.7 t/s.
Exact command and details on hardware backend are important. See here for more discussions: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/3
I'll test your quant and let you know how it fares against mine.
Thanks for jumping in!
Yes, I just replied on your thread as wellβI realized I was running -c 4096 while you were on -c 512, which explains why my initial PPL (8.22) looked "too good to be true" compared to the baseline.
I am currently re-running the test with your exact parameters (-c 512, -b 2048) on my M3 Max to get an apples-to-apples comparison. ETA is ~15 mins.
Having you test the file independently on your backend would be amazing. If it holds up on your rig, that validates the Q3_K_L as a solid daily driver for 128GB users!
Happy to share results, just let me know if there is a specific command that I should utilize.
yess download this and run this command inside your bin : https://drive.google.com/file/d/1wXEK2rhNEeaIL94JiYtC8LdjPC9LNk8j/view?usp=sharing
and run ./llama-perplexity
-m -m /Users/xx/llama.cpp/models/t1.gguf
-f wiki.test.raw
-c 512
-b 2048
--seed 1337
-ngl 99
This graph is gold. Thanks for running the benchmarks and plotting the data points!
It clearly shows the trade-off: your IQ4_XS custom mix is indeed the quality king (better PPL), while my Q3_K_L sits as the "Mainline/Standard" option for those who need to save that extra ~5GB of VRAM or want to stick to the default llama.cpp builds without custom forks.
I'm more than happy to lose the PPL battle to learn from the best. I'll definitely take you up on the invite to Beaver AI, I have a lot to learn about ik_llama recipes. See you there!
your IQ4_XS custom mix is indeed the quality king (better PPL), while my Q3_K_L sits as the "Mainline/Standard" option for those who need to save that extra ~5GB of VRAM or want to stick to the default llama.cpp builds without custom forks.
Thanks, and to be clear on a few points here again:
- The
IQ4_XSis a standard llama.cpp compatible quant and requires no special custom forks, it just works. Custom quants are built into mainline llama.cpp. - Many people are going to use this model for hybrid CPU + GPU so DRAM + VRAM. They might save ~5GB of DRAM not VRAM. Your quant does have smaller
attn.*tensors relative to mine so they might see 10% faster TG with your quant given that is memory bandwidth limited in most cases. They would only save maybe ~1GB VRAM using the standard--cpu-moeflag.
I know I'm kind of pedantic sorry not sorry lol.
Cheers and thanks for your attention and keep on learning you'll be cranking out custom quants optimizing that trade-off of quality and speed across various target hardware platforms!
Pedantic is exactly what I need right now! No apology needed, his is how I learn.
- The Data (Final Run):
I just finished the run with your exact parameters (-c 512, --seed 1337) on M3 Max:
Final estimate: PPL = 8.7948 +/- 0.07100
So the final scoreboard is:
Your IQ4_XS Mix: ~8.57 PPL (Clear winner on Quality/Reasoning)
My Q3_K_L: ~8.79 PPL (+0.22 delta)
- The Speed/Bandwidth Theory:
You nailed it on the bandwidth limitation. My attn.* tensors being smaller (Q3 vs your Q8) likely contributes to the ~28.7 t/s I'm seeing. It feels very snappy for a 230B model on a single machine.
Thanks for clarifying the DRAM vs VRAM distinction and the status of IQ quants in mainline. I'll update my mental model (and my repo description) accordingly.
I'm off to watch your talk and dive into the tensor-level quantization docs. Thanks for the crash course tonight!
Your value is now closer to what I measured for your: Q3_K_L 8.8377 +/- 0.07155
The backend can have an effect, which is another reason I like to measure
them all on the exact same hardware. But definitely more in-line with
what I might expect. Here are my full values click into the fold:
π Details
[
{
"name": "BF16",
"ppl": "8.3386 +/- 0.06651",
"size": 426.060,
"bpw": 16.003,
"legend": "full quality",
"comment": "",
"skip": true
},
{
"name": "Q8_0",
"ppl": "8.3590 +/- 0.06673",
"size": 226.431,
"bpw": 8.505,
"legend": "full quality",
"comment": "should be full quality as original is fp8",
"skip": true
},
{
"name": "IQ5_K",
"ppl": "8.4860 +/- 0.06815",
"size": 157.771,
"bpw": 5.926,
"legend": "ubergarm",
"comment": ""
},
{
"name": "IQ4_XS\n(mainline compatible)",
"ppl": "8.5702 +/- 0.06901",
"size": 114.842,
"bpw": 4.314,
"legend": "ubergarm",
"comment": "smol, with imatrix, mainline compat quant"
},
{
"name": "Q3_K_L",
"ppl": "8.8377 +/- 0.07155",
"size": 110.215,
"bpw": 4.140,
"legend": "ox-ox",
"comment": "https://huggingface.co/ox-ox/MiniMax-M2.5-GGUF"
},
{
"name": "smol-IQ3_KS",
"ppl": "8.7539 +/- 0.07075",
"size": 87.237,
"bpw": 3.277,
"legend": "ubergarm",
"comment": "full q8_0 attn.*"
},
{
"name": "derp-smol-IQ3_KS",
"ppl": "8.8293 +/- 0.07164",
"size": 86.641,
"bpw": 3.254,
"legend": "unreleased",
"comment": "iq6_k attn.*"
},
{
"name": "IQ2_KS",
"ppl": "9.6827 +/- 0.07972",
"size": 69.800,
"bpw": 2.622,
"legend": "ubergarm",
"comment": "full q8_0 attn.*"
}
]
Awesome to see the Q3_K_L officially in the mix! Thanks for the bench and the inclusion in the table. The delta is exactly what I expected. Back to the lab to check out your talk now. Cheers!
