Model Overview

  • Model Architecture: GLM-5.2
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • ROCm: 7.0.0
  • PyTorch: 2.9.0
  • Transformers: 5.8.1
  • Operating System(s): Linux
  • Inference Engine: SGLang/vLLM
  • Model Optimizer: AMD-Quark (V0.11)
    • Weight quantization: MOE-only (shared experts quantized), OCP MXFP4, Static
    • Activation quantization: MOE-only, OCP MXFP4, Dynamic

This model was built with GLM-5.2 model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from zai-org/GLM-5.2 using AMD-Quark. The weights and activations are quantized to MXFP4.

Quantization scripts:

cd Quark/examples/torch/language_modeling/llm_ptq/
  python quantize_quark.py \
      --model_dir zai-org/GLM-5.2 \
      --output_dir GLM-5.2-MXFP4 \
      --quant_scheme mxfp4 \
      --exclude_layers "*self_attn*" "*mlp.gate" "*lm_head" \
          "*mlp.gate_proj" "*mlp.up_proj" "*mlp.down_proj" \
          "*layers.78.*" \  # Exclude the MTP layer (layer 78)
      --file2file_quantization

Deployment

Use with SGLang/vLLM

This model can be deployed efficiently using the SGLang or vLLM backends.

Evaluation

The model was evaluated on GSM8K benchmarks.

Accuracy

Benchmark GLM-5.2 GLM-5.2-MXFP4(this model) Recovery
GSM8K (flexible-extract) 94.09 93.93 99.8%

Reproduction

The GSM8K results were obtained using the lm-evaluation-harness framework, based on the Docker image lmsysorg/sglang:v0.5.13.post1-rocm700-mi35x, with SGLang pre-installed inside the image and lm-eval compiled and installed from source.

lm_eval --model sglang \
    --model_args pretrained=amd/GLM-5.2-MXFP4,tp_size=4 \
    --tasks gsm8k \
    --batch_size auto

The Docker image rocm/vllm-dev:nightly_main_20260616 with vLLM pre-installed can also be used for reproducing using vLLM backend.

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_FP8BMM=0
export VLLM_ROCM_USE_AITER_FP4BMM=0
lm_eval --model vllm \
    --model_args 'pretrained=amd/GLM-5.2-MXFP4,tensor_parallel_size=4,dtype=auto,quantization='quark',gpu_memory_utilization=0.9,max_model_len=32768,trust_remote_code=True' \
    --tasks gsm8k \
    --batch_size auto

License

Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
218
Safetensors
Model size
412B params
Tensor type
U8
F32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for amd/GLM-5.2-MXFP4

Base model

zai-org/GLM-5.2
Quantized
(32)
this model