Instructions to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mrtoots/unsloth-GLM-4.5-MLX-2Bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mrtoots/unsloth-GLM-4.5-MLX-2Bit")
model = AutoModelForCausalLM.from_pretrained("mrtoots/unsloth-GLM-4.5-MLX-2Bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

MLX

How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mrtoots/unsloth-GLM-4.5-MLX-2Bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

vLLM

How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mrtoots/unsloth-GLM-4.5-MLX-2Bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mrtoots/unsloth-GLM-4.5-MLX-2Bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mrtoots/unsloth-GLM-4.5-MLX-2Bit

SGLang

How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mrtoots/unsloth-GLM-4.5-MLX-2Bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mrtoots/unsloth-GLM-4.5-MLX-2Bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mrtoots/unsloth-GLM-4.5-MLX-2Bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mrtoots/unsloth-GLM-4.5-MLX-2Bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Pi new

How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "mrtoots/unsloth-GLM-4.5-MLX-2Bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mrtoots/unsloth-GLM-4.5-MLX-2Bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "mrtoots/unsloth-GLM-4.5-MLX-2Bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default mrtoots/unsloth-GLM-4.5-MLX-2Bit

Run Hermes

hermes

MLX LM

How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "mrtoots/unsloth-GLM-4.5-MLX-2Bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "mrtoots/unsloth-GLM-4.5-MLX-2Bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "mrtoots/unsloth-GLM-4.5-MLX-2Bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Docker Model Runner
How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with Docker Model Runner:
```
docker model run hf.co/mrtoots/unsloth-GLM-4.5-MLX-2Bit
```

Awful loss of coherence between MLX 2bit and IQ2_XSS 🙀

by overhead520 - opened Aug 27, 2025

Discussion

overhead520

Aug 27, 2025

•

edited Aug 27, 2025

Hi! Thank you for taking the time to make and share your MLX 2bit quant. 🥰👍
I'm using Unloth IQ2_XSS GGUF and it works amazingly well, if a bit slow.
On the other hand your MLX 2bit version (and another one I quantized myself) is awfully incoherent by comparison (using same settings).

Example prompt, where the model confuses a tongue 👅 with a language 🗣️ and rambles like a drunk man 🥴
👤 How would you describe a cat's tongue?
🤖 (MLX 2bit) The first thing that comes to mind for those who are from countries where onomatopoeia is used frequently and cats are common, would likely be "meow" or "purr". The word meow is a familiar sound and it seems to be the only thing that comes to mind when thinking of cats. However, there are other words that can be used to describe a cat's tongue.
The word meow is often used as an interjection and it can be difficult to tell if the speaker means something else. The word meow has been used in many different contexts and it can be difficult to know which meaning is being referred to. The word meow has been used in many different contexts and it can be difficult to know which meaning is being referred to.
(LM Studio: Temp 0.06, MinP 0.05, TopP 0.92)

deleted

Aug 27, 2025

Same experience.
Chatted with Nightmedia about it and he says it's difference between the options GGUF and MLX allow for fine grained mixed quants. GGUF allows more freedom and I think a smaller page size.

Unsloth's XXS is working well for me: about 13t/s with 91/93 layers offloaded (full offload ≤1t/s), flash attention on, 24k context.
Can push it to 32k with a fp16 V quant no K quant - holds together fairly well.
sudo sysctl iogpu.wired_limit_mb=120826
On a M3 128GB.

overhead520

Aug 27, 2025

Thanks @L285348120816 for sharing this feedback (and your speed benchmark).
I ran Unsloth's IQ2_XXS at 17 t/s with a M4 Max 128GB via Kobold.cpp (full offload, context shift, flash attention, use MMAP, use mlock, use FastForwarding, 8bit KV cache), with 16384 context., sysctl iogpu.wired_limit_mb=120000.

mrtoots

Owner Aug 27, 2025

•

edited Aug 28, 2025

I know user deleted but an update: Unsloth updated their GLM 4.5 hf so now can run it through MLX quants easier. (Thanks @shimmyshimmer and @NeuralNovel )

I'll convert the Unsloth 4.5 2bit and 3bit and maybe those will run better.

(edit) both are now converted using unsloth as the base model for GLM 4.5 2bit and 3bit.

overhead520

Aug 29, 2025

I'll convert the Unsloth 4.5 2bit and 3bit and maybe those will run better.
(edit) both are now converted using unsloth as the base model for GLM 4.5 2bit and 3bit.

Sadly, no visible improvement for me. 😓

mrtoots

Owner Aug 29, 2025

Bummer, I was able to get it to behave a little better with modifying sampler settings but it’s not as good as deepseek for me. Better off with the 3 bit version than 2 bit

Tompai

Aug 29, 2025

Thank you for your efforts. I tested the new 2-bit version in LM Studio, but unfortunately I have to say that it does not deliver usable results.

mrtoots

Owner Aug 29, 2025

Might just not be flexible for around 110GB for MLX quants. I'll look into other quant methods more and see if any work well.

Until then, around 110GB size you can run 4.5 Air at 8Bit (113GB) or GLM 4.5v 8bit (115GB). My personal favorite lately is any fine tune or quant of Wayfarer Large which is much smaller at 8bit (75GB).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment