Instructions to use QuantTrio/GLM-5-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-5-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/GLM-5-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-5-AWQ")
model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-5-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QuantTrio/GLM-5-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-5-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-5-AWQ

SGLang

How to use QuantTrio/GLM-5-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-5-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-5-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-5-AWQ with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-5-AWQ
```

GLM-5-AWQ / README.md

JunHowie

Update README.md

a838c55 verified 3 months ago

preview code

raw

history blame contribute delete

11.6 kB

	---
	library_name: transformers
	license: mit
	pipeline_tag: text-generation
	tags:
	- vLLM
	- AWQ
	base_model:
	- zai-org/GLM-5
	base_model_relation: quantized

	---
	# GLM-5-AWQ
	Base model: [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)

	This repo quantizes the model using data-free quantization (no calibration dataset required).

	### 【Dependencies / Installation】

	```python
	# NOTE:
	# vllm==0.16.0rc2 absolutely would NOT work!
	# Must upgrade to >=0.16.1rc1
	vllm>=0.16.1rc1.dev7
	transformers>=5.3.0.dev0
	```

	As of 2026-02-26, make sure your system has cuda12.8 installed.

	Then, create a fresh Python environment (e.g. python3.12 venv) and run:
	```bash
	pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
	pip install git+https://github.com/huggingface/transformers.git
	pip install git+https://github.com/deepseek-ai/DeepGEMM.git@v2.1.1.post3 --no-build-isolation
	```
	[vLLM Official Guide](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html)


	### 【vLLM Startup Command】
	<i>Note: When launching with TP=8, include `--enable-expert-parallel`;
	otherwise the expert tensors wouldn’t be evenly sharded across GPU devices.</i>

	```
	export VLLM_USE_DEEP_GEMM=0
	export VLLM_USE_FLASHINFER_MOE_FP16=1
	export VLLM_USE_FLASHINFER_SAMPLER=0
	export OMP_NUM_THREADS=4

	vllm serve \
	__YOUR_PATH__/QuantTrio/GLM-5-AWQ \
	--served-model-name MY_MODEL \
	--swap-space 16 \
	--max-num-seqs 32 \
	--max-model-len 32768 \
	--gpu-memory-utilization 0.9 \
	--tensor-parallel-size 8 \
	--enable-expert-parallel \
	--enable-auto-tool-choice \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
	--trust-remote-code \
	--host 0.0.0.0 \
	--port 8000
	```

	### 【Logs】
	```
	2026-02-26
	1. Initial commit
	```

	### 【Model Files】
	\| File Size \| Last Updated \|
	\|-----------\|--------------\|
	\| `392 GiB` \| `2026-02-26` \|

	### 【Model Download】
	```python
	from huggingface_hub import snapshot_download
	snapshot_download('QuantTrio/GLM-5-AWQ', cache_dir="your_local_path")
	```

	### 【Overview】

	# GLM-5

	<div align="center">
	<img src=https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/logo.svg width="15%"/>
	</div>
	<p align="center">
	👋 Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/wechat.png" target="_blank">WeChat</a> or <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community.
	<br>
	📖 Check out the GLM-5 <a href="https://z.ai/blog/glm-5" target="_blank">technical blog</a>.
	<br>
	📍 Use GLM-5 API services on <a href="https://docs.z.ai/guides/llm/glm-5">Z.ai API Platform. </a>
	<br>
	👉 One click to <a href="https://chat.z.ai">GLM-5</a>.
	</p>

	## Introduction

	We are launching GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.

	Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed [slime](https://github.com/THUDM/slime), a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.

	## Benchmark

	\| \| GLM-5 \| GLM-4.7 \| DeepSeek-V3.2 \| Kimi K2.5 \| Claude Opus 4.5 \| Gemini 3 Pro \| GPT-5.2 (xhigh) \|
	\| -------------------------------- \| ---------------------- \| --------- \| ------------- \|-----------\| --------------- \| ------------ \| --------------- \|
	\| HLE \| 30.5 \| 24.8 \| 25.1 \| 31.5 \| 28.4 \| 37.2 \| 35.4 \|
	\| HLE (w/ Tools) \| 50.4 \| 42.8 \| 40.8 \| 51.8 \| 43.4* \| 45.8* \| 45.5* \|
	\| AIME 2026 I \| 92.7 \| 92.9 \| 92.7 \| 92.5 \| 93.3 \| 90.6 \| - \|
	\| HMMT Nov. 2025 \| 96.9 \| 93.5 \| 90.2 \| 91.1 \| 91.7 \| 93.0 \| 97.1 \|
	\| IMOAnswerBench \| 82.5 \| 82.0 \| 78.3 \| 81.8 \| 78.5 \| 83.3 \| 86.3 \|
	\| GPQA-Diamond \| 86.0 \| 85.7 \| 82.4 \| 87.6 \| 87.0 \| 91.9 \| 92.4 \|
	\| SWE-bench Verified \| 77.8 \| 73.8 \| 73.1 \| 76.8 \| 80.9 \| 76.2 \| 80.0 \|
	\| SWE-bench Multilingual \| 73.3 \| 66.7 \| 70.2 \| 73.0 \| 77.5 \| 65.0 \| 72.0 \|
	\| Terminal-Bench 2.0 (Terminus 2) \| 56.2 / 60.7 † \| 41.0 \| 39.3 \| 50.8 \| 59.3 \| 54.2 \| 54.0 \|
	\| Terminal-Bench 2.0 (Claude Code) \| 56.2 / 61.1 † \| 32.8 \| 46.4 \| - \| 57.9 \| - \| - \|
	\| CyberGym \| 43.2 \| 23.5 \| 17.3 \| 41.3 \| 50.6 \| 39.9 \| - \|
	\| BrowseComp \| 62.0 \| 52.0 \| 51.4 \| 60.6 \| 37.0 \| 37.8 \| - \|
	\| BrowseComp (w/ Context Manage) \| 75.9 \| 67.5 \| 67.6 \| 74.9 \| 67.8 \| 59.2 \| 65.8 \|
	\| BrowseComp-Zh \| 72.7 \| 66.6 \| 65.0 \| 62.3 \| 62.4 \| 66.8 \| 76.1 \|
	\| τ²-Bench \| 89.7 \| 87.4 \| 85.3 \| 80.2 \| 91.6 \| 90.7 \| 85.5 \|
	\| MCP-Atlas (Public Set) \| 67.8 \| 52.0 \| 62.2 \| 63.8 \| 65.2 \| 66.6 \| 68.0 \|
	\| Tool-Decathlon \| 38.0 \| 23.8 \| 35.2 \| 27.8 \| 43.5 \| 36.4 \| 46.3 \|
	\| Vending Bench 2 \| $4,432.12 \| $2,376.82 \| $1,034.00 \| $1,198.46 \| $4,967.06 \| $5,478.16 \| $3,591.33 \|

	> *: refers to their scores of full set.
	>
	> †: A verified version of Terminal-Bench 2.0 that fixes some ambiguous instructions.
	See footnote for more evaluation details.

	### Footnote

	* Humanity’s Last Exam (HLE) & other reasoning tasks: We evaluate with a maximum generation length of 131,072 tokens (`temperature=1.0, top_p=0.95, max_new_tokens=131072`). By default, we report the text-only subset; results marked with * are from the full set. We use GPT-5.2 (medium) as the judge model. For HLE-with-tools, we use a maximum context length of 202,752 tokens.
	* SWE-bench & SWE-bench Multilingual: We run the SWE-bench suite with OpenHands using a tailored instruction prompt. Settings: `temperature=0.7, top_p=0.95, max_new_tokens=16384`, with a 200K context window.
	* BrowserComp: Without context management, we retain details from the most recent 5 turns. With context management, we use the same discard-all strategy as DeepSeek-v3.2 and Kimi K2.5.
	* Terminal-Bench 2.0 (Terminus 2): We evaluate with the Terminus framework using `timeout=2h, temperature=0.7, top_p=1.0, max_new_tokens=8192`, with a 128K context window. Resource limits are capped at 16 CPUs and 32 GB RAM.
	* Terminal-Bench 2.0 (Claude Code): We evaluate in Claude Code 2.1.14 (think mode, default effort) with `temperature=1.0, top_p=0.95, max_new_tokens=65536`. We remove wall-clock time limits due to generation speed, while preserving per-task CPU and memory constraints. Scores are averaged over 5 runs. We fix environment issues introduced by Claude Code and also report results on a verified Terminal-Bench 2.0 dataset that resolves ambiguous instructions (see: [https://huggingface.co/datasets/zai-org/terminal-bench-2-verified](https://huggingface.co/datasets/zai-org/terminal-bench-2-verified)).
	* CyberGym: We evaluate in Claude Code 2.1.18 (think mode, no web tools) with (`temperature=1.0, top_p=1.0, max_new_tokens=32000`) and a 250-minute timeout per task. Results are single-run Pass@1 over 1,507 tasks.
	* MCP-Atlas: All models are evaluated in think mode on the 500-task public subset with a 10-minute timeout per task. We use Gemini 3 Pro as the judge model.
	* τ²-bench: We add a small prompt adjustment in Retail and Telecom to avoid failures caused by premature user termination. For Airline, we apply the domain fixes proposed in the Claude Opus 4.5 system card.
	* Vending Bench 2: Runs are conducted independently by [Andon Labs](https://andonlabs.com/evals/vending-bench-2).


	## Serve GLM-5 Locally

	### Prepare environment

	vLLM, SGLang, and xLLM all support local deployment of GLM-5. A simple deployment guide is provided here.

	+ vLLM

	Using Docker as:

	```shell
	docker pull vllm/vllm-openai:nightly
	```

	or using pip:

	```shell
	pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
	```

	then upgrade transformers:

	```
	pip install git+https://github.com/huggingface/transformers.git
	```

	+ SGLang

	Using Docker as:
	```bash
	docker pull lmsysorg/sglang:glm5-hopper # For Hopper GPU
	docker pull lmsysorg/sglang:glm5-blackwell # For Blackwell GPU
	```

	### Deploy

	+ vLLM

	```shell
	vllm serve zai-org/GLM-5-FP8 \
	--tensor-parallel-size 8 \
	--gpu-memory-utilization 0.85 \
	--speculative-config.method mtp \
	--speculative-config.num_speculative_tokens 1 \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--enable-auto-tool-choice \
	--served-model-name glm-5-fp8
	```

	Check the [recipes](https://github.com/vllm-project/recipes/blob/main/GLM/GLM5.md) for more details.

	+ SGLang

	```shell
	python3 -m sglang.launch_server \
	--model-path zai-org/GLM-5-FP8 \
	--tp-size 8 \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--speculative-algorithm EAGLE \
	--speculative-num-steps 3 \
	--speculative-eagle-topk 1 \
	--speculative-num-draft-tokens 4 \
	--mem-fraction-static 0.85 \
	--served-model-name glm-5-fp8
	```

	Check the [sglang cookbook](https://cookbook.sglang.io/autoregressive/GLM/GLM-5) for more details.

	+ xLLM and other Ascend NPU

	Please check the deployment guide [here](https://github.com/zai-org/GLM-5/blob/main/example/ascend.md).


	## Citation

	Our technical report is coming soon.