Instructions to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="magiccodingman/Qwen3.6-27B-MagicQuant-GGUF", filename="Qwen3.6-27B-LM-IQ2_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M # Run inference directly in the terminal: llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M # Run inference directly in the terminal: llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M # Run inference directly in the terminal: ./llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
Use Docker
docker model run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
- LM Studio
- Jan
- vLLM
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "magiccodingman/Qwen3.6-27B-MagicQuant-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "magiccodingman/Qwen3.6-27B-MagicQuant-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
- Ollama
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Ollama:
ollama run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
- Unsloth Studio new
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for magiccodingman/Qwen3.6-27B-MagicQuant-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for magiccodingman/Qwen3.6-27B-MagicQuant-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for magiccodingman/Qwen3.6-27B-MagicQuant-GGUF to start chatting
- Pi new
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
Run Hermes
hermes
- Docker Model Runner
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Docker Model Runner:
docker model run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
- Lemonade
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
Run and chat with the model
lemonade run user.Qwen3.6-27B-MagicQuant-GGUF-IQ2_M
List all available models
lemonade list
MTP?
are the MTP layers stripped? if not, i would love to use this together with https://github.com/ggml-org/llama.cpp/pull/22673 ! as currently, there are no good quants of this model which fit within 24gb and MTP makes a big difference (20tps --> 50tps)
Yeah, I've been watching that PR (very exciting. I use MTP on my vLLM setup and it's amazing). I'm waiting for it to merge into master since there's still a bit of final work for it to be fully stable. Once it's merged in and done, I'd be more than happy to rebuild and re-post the quants properly. But the llama.cpp I used for this model was a main git pull I did today.
Oh and since you're on a 24GB model. Hope you enjoy the MQ-Q6_K_3 and MQ-Q5_K_S_1 because this model was the first to fire anomaly detection with any models I've worked with thus far within the predictive engine. AKA, it was able to detect things that broke standard quantization rules and abused that pattern discovered excessively. That's why there's no Q8 because it found Q6 patterns that could not just be smaller but better KLD than Q8, which was pretty cool.
You only see that when anomaly detection fires because the architecture had weirdness that isn't normal and could be replicated. But both that Q6 and Q5 hit far above it's weight. The Q5_K_S_1 beat the standard llama.cpp Q6_K which was super cool too.
i've tested the PR on my system with Vulkan and it works fine on my system! people reported -50% PP which is why it's not merged but i personally still get 600tps pp (from maybe 900) but the TG from 20 to 40-60 is 100% worth it. pretty please? 🥺 it's hard to go back once you try it and vLLM isn't a great experience on 24gb but there aren't good quants for it, they all feel dumb </3
can barely run Q5_K_S_1 so i'll check that one out
crazy how you're able to pull this off. i really appreciate your work! been following you for months :D
Thank you! And you're tempting me now! Is that PR just not stripping those MTP tensors or something? Meaning as long as MTP isn't enabled is it stable? Would you know?
Also, the wiki isn't done being fully updated, but I'm trying to document how it works. But this is due to how v2.0 works.
But basically it's utilizing what it learned from other model tensor configurations. Usually better versions of the model is simply making trades. It's not necessarily less bits, just swapping where we prioritize bits. This is the "normal".
But what I call an "anomaly" within my system is where there's a strict violation to the rule. In isolated sampling, ffn_down group at Q8 had a lower KLD than Q6. Thus Q8 is better than Q6, which is obvious and standard. But there was emergent behavior the system smoke tested and was able to find and validate to find localized emergent behavior in real hybrid scenario's that allowed a violation to the rule and thus patterns could be utilized to cause Q6_K to beat Q8_0 in specific groups or patterns.
Similar to how it utilized IQ4_NL in embedding group. THis wasn't a violation to the rule, it just hit way above what it deserved too with emergent behavior and this was utilized as well. Each MagicQuant repo actually is automated with everything created unless I manually change the ReadMe a bit like I tend to do. But the magicquant-manifest folder in every repo is full transparent logs of what the hybrids derive from. If it's utilizing Unsloth Dynamic learned configurations when and where. Plus tensor by tensor configuration maps for full reproducibility :)
I think the MQ-Q5 actually took learned configurations from Unsloth Dynamic UD-Q6 with a mix of Q8 if I remember correctly to pull it off.
Is that PR just not stripping those MTP tensors or something? Meaning as long as MTP isn't enabled is it stable? Would you know?
from what i've read, the way models are quantised strips the MTP layer since it's unsupported, but it's needed for MTP to work so from my understanding, as long as it's not stripped, it should work with the PR. not certain, but pretty sure.
sadly i just got it to my disk, just to realise it's 22gb and i can't fit that with 128k KV cache even with q8_0, gonna have to settle for less but your models are goated regardless
no pressure but like... 🥺 🙏 seriously though if it is not much effort you'd win my heart
Haha, I got you. I was about to get off for the night and I am going to be at my brothers wedding and doing wedding stuff for like 2 days. So I may not be able to come back to this and add MTP till this weekend. I just want to test that it doesn't break anything else, but if I have time after dinner I'll try to fit it in before I head out :)
awesome, thank you very much! enjoy the wedding and wishing you all the best, can't wait for MTP :D
I'm heading out for wedding stuff, but I'll continue looking into this when I'm back. There's a lot of reports specifically about it causing issues with images right now. So I just need to make sure I properly test it to make sure the model is backwards compatible and it can be disabled. That way the posted version is still stable :)
Though I'm very excited to use MTP as well! It's a world of difference.
yes, the only issue is the PP regression and images crashing. if anything, it could probably just be in a separate qwen3.6-27b-mtp-magiquant repo, but makes sense you don't wanna bother until everything is stable. hope you have a nice time!
The MTP is now merged into main. I'll be re-uploading the 27B soon with the MTP added. I was waiting to upload the Uncensored model as well until MTP was ready.
awesome. i just recently got a new laptop and i've been experimenting with GGUFs yesterday and was actually quite upset i had to use other GGUFs (no time to graft MTP layers, busy with work!)
appreciate your work like always brother 🫶
Still working on re-uploading with MTP. I've been busy and unable to look into it, but something is going on with the llama.cpp build. It keeps saying that specific important matrix required for specific tensors is missing the imatrix information it needs. Even though I rebuilt the imatrix. And I thought I could just graft it on a bit easier, but it seems more sensitive than I realized. I need to fix whatever is wrong on my end because obviously others have gotten it working. It may not even be llama.cpp but because I built the code to remember models and be more re-usable. I am wondering if I messed something up so that the same model that had tensors not saved previously but now saved if that could cause issues or something else.
Hopefully will get it figured out this weekend.
all the good luck to you