Instructions to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="magiccodingman/Qwen3.6-27B-MagicQuant-GGUF",
	filename="Qwen3.6-27B-LM-IQ2_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
# Run inference directly in the terminal:
./llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Use Docker

docker model run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

LM Studio
Jan

vLLM

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "magiccodingman/Qwen3.6-27B-MagicQuant-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "magiccodingman/Qwen3.6-27B-MagicQuant-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Ollama
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Ollama:
```
ollama run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
```

Unsloth Studio new

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Qwen3.6-27B-MagicQuant-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Qwen3.6-27B-MagicQuant-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for magiccodingman/Qwen3.6-27B-MagicQuant-GGUF to start chatting

Pi new

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Run Hermes

hermes

Docker Model Runner
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Docker Model Runner:
```
docker model run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
```

Lemonade

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Run and chat with the model

lemonade run user.Qwen3.6-27B-MagicQuant-GGUF-IQ2_M

List all available models

lemonade list

MTP?

by floory - opened 15 days ago

Discussion

floory

15 days ago

are the MTP layers stripped? if not, i would love to use this together with https://github.com/ggml-org/llama.cpp/pull/22673 ! as currently, there are no good quants of this model which fit within 24gb and MTP makes a big difference (20tps --> 50tps)

magiccodingman

Owner 15 days ago

Yeah, I've been watching that PR (very exciting. I use MTP on my vLLM setup and it's amazing). I'm waiting for it to merge into master since there's still a bit of final work for it to be fully stable. Once it's merged in and done, I'd be more than happy to rebuild and re-post the quants properly. But the llama.cpp I used for this model was a main git pull I did today.

magiccodingman

Owner 15 days ago

•

edited 15 days ago

Oh and since you're on a 24GB model. Hope you enjoy the MQ-Q6_K_3 and MQ-Q5_K_S_1 because this model was the first to fire anomaly detection with any models I've worked with thus far within the predictive engine. AKA, it was able to detect things that broke standard quantization rules and abused that pattern discovered excessively. That's why there's no Q8 because it found Q6 patterns that could not just be smaller but better KLD than Q8, which was pretty cool.

You only see that when anomaly detection fires because the architecture had weirdness that isn't normal and could be replicated. But both that Q6 and Q5 hit far above it's weight. The Q5_K_S_1 beat the standard llama.cpp Q6_K which was super cool too.

floory

15 days ago

•

edited 15 days ago

i've tested the PR on my system with Vulkan and it works fine on my system! people reported -50% PP which is why it's not merged but i personally still get 600tps pp (from maybe 900) but the TG from 20 to 40-60 is 100% worth it. pretty please? 🥺 it's hard to go back once you try it and vLLM isn't a great experience on 24gb but there aren't good quants for it, they all feel dumb </3

can barely run Q5_K_S_1 so i'll check that one out

crazy how you're able to pull this off. i really appreciate your work! been following you for months :D

magiccodingman

Owner 15 days ago

•

edited 15 days ago

Thank you! And you're tempting me now! Is that PR just not stripping those MTP tensors or something? Meaning as long as MTP isn't enabled is it stable? Would you know?

Also, the wiki isn't done being fully updated, but I'm trying to document how it works. But this is due to how v2.0 works.

But basically it's utilizing what it learned from other model tensor configurations. Usually better versions of the model is simply making trades. It's not necessarily less bits, just swapping where we prioritize bits. This is the "normal".

But what I call an "anomaly" within my system is where there's a strict violation to the rule. In isolated sampling, ffn_down group at Q8 had a lower KLD than Q6. Thus Q8 is better than Q6, which is obvious and standard. But there was emergent behavior the system smoke tested and was able to find and validate to find localized emergent behavior in real hybrid scenario's that allowed a violation to the rule and thus patterns could be utilized to cause Q6_K to beat Q8_0 in specific groups or patterns.

Similar to how it utilized IQ4_NL in embedding group. THis wasn't a violation to the rule, it just hit way above what it deserved too with emergent behavior and this was utilized as well. Each MagicQuant repo actually is automated with everything created unless I manually change the ReadMe a bit like I tend to do. But the magicquant-manifest folder in every repo is full transparent logs of what the hybrids derive from. If it's utilizing Unsloth Dynamic learned configurations when and where. Plus tensor by tensor configuration maps for full reproducibility :)

I think the MQ-Q5 actually took learned configurations from Unsloth Dynamic UD-Q6 with a mix of Q8 if I remember correctly to pull it off.

floory

15 days ago

Is that PR just not stripping those MTP tensors or something? Meaning as long as MTP isn't enabled is it stable? Would you know?

from what i've read, the way models are quantised strips the MTP layer since it's unsupported, but it's needed for MTP to work so from my understanding, as long as it's not stripped, it should work with the PR. not certain, but pretty sure.

sadly i just got it to my disk, just to realise it's 22gb and i can't fit that with 128k KV cache even with q8_0, gonna have to settle for less but your models are goated regardless

no pressure but like... 🥺 🙏 seriously though if it is not much effort you'd win my heart

magiccodingman

Owner 15 days ago

Haha, I got you. I was about to get off for the night and I am going to be at my brothers wedding and doing wedding stuff for like 2 days. So I may not be able to come back to this and add MTP till this weekend. I just want to test that it doesn't break anything else, but if I have time after dinner I'll try to fit it in before I head out :)

floory

15 days ago

awesome, thank you very much! enjoy the wedding and wishing you all the best, can't wait for MTP :D

magiccodingman

Owner 15 days ago

I'm heading out for wedding stuff, but I'll continue looking into this when I'm back. There's a lot of reports specifically about it causing issues with images right now. So I just need to make sure I properly test it to make sure the model is backwards compatible and it can be disabled. That way the posted version is still stable :)

Though I'm very excited to use MTP as well! It's a world of difference.

floory

14 days ago

yes, the only issue is the PP regression and images crashing. if anything, it could probably just be in a separate qwen3.6-27b-mtp-magiquant repo, but makes sense you don't wanna bother until everything is stable. hope you have a nice time!

magiccodingman

Owner 3 days ago

The MTP is now merged into main. I'll be re-uploading the 27B soon with the MTP added. I was waiting to upload the Uncensored model as well until MTP was ready.

floory

3 days ago

awesome. i just recently got a new laptop and i've been experimenting with GGUFs yesterday and was actually quite upset i had to use other GGUFs (no time to graft MTP layers, busy with work!)

appreciate your work like always brother 🫶

magiccodingman

Owner about 16 hours ago

Still working on re-uploading with MTP. I've been busy and unable to look into it, but something is going on with the llama.cpp build. It keeps saying that specific important matrix required for specific tensors is missing the imatrix information it needs. Even though I rebuilt the imatrix. And I thought I could just graft it on a bit easier, but it seems more sensitive than I realized. I need to fix whatever is wrong on my end because obviously others have gotten it working. It may not even be llama.cpp but because I built the code to remember models and be more re-usable. I am wondering if I messed something up so that the same model that had tensors not saved previously but now saved if that could cause issues or something else.

Hopefully will get it figured out this weekend.

floory

about 10 hours ago

all the good luck to you

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment