Instructions to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mrtoots/unsloth-GLM-4.5-MLX-2Bit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mrtoots/unsloth-GLM-4.5-MLX-2Bit") model = AutoModelForCausalLM.from_pretrained("mrtoots/unsloth-GLM-4.5-MLX-2Bit") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - MLX
How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("mrtoots/unsloth-GLM-4.5-MLX-2Bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- vLLM
How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mrtoots/unsloth-GLM-4.5-MLX-2Bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mrtoots/unsloth-GLM-4.5-MLX-2Bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mrtoots/unsloth-GLM-4.5-MLX-2Bit
- SGLang
How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mrtoots/unsloth-GLM-4.5-MLX-2Bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mrtoots/unsloth-GLM-4.5-MLX-2Bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mrtoots/unsloth-GLM-4.5-MLX-2Bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mrtoots/unsloth-GLM-4.5-MLX-2Bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Pi new
How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mrtoots/unsloth-GLM-4.5-MLX-2Bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mrtoots/unsloth-GLM-4.5-MLX-2Bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mrtoots/unsloth-GLM-4.5-MLX-2Bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mrtoots/unsloth-GLM-4.5-MLX-2Bit
Run Hermes
hermes
- MLX LM
How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "mrtoots/unsloth-GLM-4.5-MLX-2Bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "mrtoots/unsloth-GLM-4.5-MLX-2Bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mrtoots/unsloth-GLM-4.5-MLX-2Bit", "messages": [ {"role": "user", "content": "Hello"} ] }' - Docker Model Runner
How to use mrtoots/unsloth-GLM-4.5-MLX-2Bit with Docker Model Runner:
docker model run hf.co/mrtoots/unsloth-GLM-4.5-MLX-2Bit
Awful loss of coherence between MLX 2bit and IQ2_XSS π
Hi! Thank you for taking the time to make and share your MLX 2bit quant. π₯°π
I'm using Unloth IQ2_XSS GGUF and it works amazingly well, if a bit slow.
On the other hand your MLX 2bit version (and another one I quantized myself) is awfully incoherent by comparison (using same settings).
Example prompt, where the model confuses a tongue π
with a language π£οΈ and rambles like a drunk man π₯΄
π€ How would you describe a cat's tongue?
π€ (MLX 2bit) The first thing that comes to mind for those who are from countries where onomatopoeia is used frequently and cats are common, would likely be "meow" or "purr". The word meow is a familiar sound and it seems to be the only thing that comes to mind when thinking of cats. However, there are other words that can be used to describe a cat's tongue.
The word meow is often used as an interjection and it can be difficult to tell if the speaker means something else. The word meow has been used in many different contexts and it can be difficult to know which meaning is being referred to. The word meow has been used in many different contexts and it can be difficult to know which meaning is being referred to.
(LM Studio: Temp 0.06, MinP 0.05, TopP 0.92)
Same experience.
Chatted with Nightmedia about it and he says it's difference between the options GGUF and MLX allow for fine grained mixed quants. GGUF allows more freedom and I think a smaller page size.
Unsloth's XXS is working well for me: about 13t/s with 91/93 layers offloaded (full offload β€1t/s), flash attention on, 24k context.
Can push it to 32k with a fp16 V quant no K quant - holds together fairly well.
sudo sysctl iogpu.wired_limit_mb=120826
On a M3 128GB.
Thanks @L285348120816 for sharing this feedback (and your speed benchmark).
I ran Unsloth's IQ2_XXS at 17 t/s with a M4 Max 128GB via Kobold.cpp (full offload, context shift, flash attention, use MMAP, use mlock, use FastForwarding, 8bit KV cache), with 16384 context., sysctl iogpu.wired_limit_mb=120000.
I know user deleted but an update: Unsloth updated their GLM 4.5 hf so now can run it through MLX quants easier. (Thanks @shimmyshimmer and @NeuralNovel )
I'll convert the Unsloth 4.5 2bit and 3bit and maybe those will run better.
(edit) both are now converted using unsloth as the base model for GLM 4.5 2bit and 3bit.
I'll convert the Unsloth 4.5 2bit and 3bit and maybe those will run better.
(edit) both are now converted using unsloth as the base model for GLM 4.5 2bit and 3bit.
Sadly, no visible improvement for me. π
Bummer, I was able to get it to behave a little better with modifying sampler settings but itβs not as good as deepseek for me. Better off with the 3 bit version than 2 bit
Thank you for your efforts. I tested the new 2-bit version in LM Studio, but unfortunately I have to say that it does not deliver usable results.
Might just not be flexible for around 110GB for MLX quants. I'll look into other quant methods more and see if any work well.
Until then, around 110GB size you can run 4.5 Air at 8Bit (113GB) or GLM 4.5v 8bit (115GB). My personal favorite lately is any fine tune or quant of Wayfarer Large which is much smaller at 8bit (75GB).