Model Overview

Model Architecture: GLM-5.2
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI350/MI355
ROCm: 7.0.0
PyTorch: 2.9.0
Transformers: 5.8.1
Operating System(s): Linux
Inference Engine: SGLang/vLLM
Model Optimizer: AMD-Quark (V0.11)
- Weight quantization: MOE-only (shared experts quantized), OCP MXFP4, Static
- Activation quantization: MOE-only, OCP MXFP4, Dynamic

This model was built with GLM-5.2 model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from zai-org/GLM-5.2 using AMD-Quark. The weights and activations are quantized to MXFP4.

Quantization scripts:

cd Quark/examples/torch/language_modeling/llm_ptq/
  python quantize_quark.py \
      --model_dir zai-org/GLM-5.2 \
      --output_dir GLM-5.2-MXFP4 \
      --quant_scheme mxfp4 \
      --exclude_layers "*self_attn*" "*mlp.gate" "*lm_head" \
          "*mlp.gate_proj" "*mlp.up_proj" "*mlp.down_proj" \
          "*layers.78.*" \  # Exclude the MTP layer (layer 78)
      --file2file_quantization

Deployment

Use with SGLang/vLLM

This model can be deployed efficiently using the SGLang or vLLM backends.

Evaluation

The model was evaluated on GSM8K benchmarks.

Accuracy

Benchmark	GLM-5.2	GLM-5.2-MXFP4(this model)	Recovery
GSM8K (flexible-extract)	94.09	93.93	99.8%

Reproduction

The GSM8K results were obtained using the lm-evaluation-harness framework, based on the Docker image lmsysorg/sglang:v0.5.13.post1-rocm700-mi35x, with SGLang pre-installed inside the image and lm-eval compiled and installed from source.

lm_eval --model sglang \
    --model_args pretrained=amd/GLM-5.2-MXFP4,tp_size=4 \
    --tasks gsm8k \
    --batch_size auto

The Docker image rocm/vllm-dev:nightly_main_20260616 with vLLM pre-installed can also be used for reproducing using vLLM backend.

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_FP8BMM=0
export VLLM_ROCM_USE_AITER_FP4BMM=0
lm_eval --model vllm \
    --model_args 'pretrained=amd/GLM-5.2-MXFP4,tensor_parallel_size=4,dtype=auto,quantization='quark',gpu_memory_utilization=0.9,max_model_len=32768,trust_remote_code=True' \
    --tasks gsm8k \
    --batch_size auto

License

Downloads last month: 218

Safetensors

Model size

412B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/GLM-5.2-MXFP4

Base model

zai-org/GLM-5.2

Quantized

(32)

this model