GLM-4.6-REAP-218B-A32B W4A16 (AutoRound Quantization)

This is a 4-bit quantized version of cerebras/GLM-4.6-REAP-218B-A32B using Intel AutoRound.

Model Details

Property Value
Base Model cerebras/GLM-4.6-REAP-218B-A32B
Quantization W4A16 (4-bit weights, 16-bit activations)
Method Intel AutoRound
Format auto_round (compatible with vLLM, SGLang)
Architecture GLM-4 Mixture of Experts
Total Parameters 218B
Active Parameters 32B (A32B)
Original Size ~436 GB (BF16)
Quantized Size ~116 GB

Performance Benchmarks

Tested on 8x NVIDIA RTX 3090 (24GB each) with vLLM:

Speed Test (~20k context)

Metric Value
Prompt Tokens ~21,178
Completion Tokens 393
Time to First Token (TTFB) 23.82s
Total Generation Time 36.45s
Prefill Speed ~889 tok/s
Generation Speed ~31 tok/s

Coherence Test

The model correctly recalled all embedded facts from a long context:

  • Character name: Aurelia
  • Product code: ZX-42-ALPHA
  • Transaction amount: 7,530,000 credits
  • Scientist name: Dr. Linh Tran
  • Date: 2025-12-15

Usage

vLLM (Recommended)

  vllm serve /GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
      --host 0.0.0.0 --port 8000 \
      --tensor-parallel-size 4 --pipeline-parallel-size 2 \
      --quantization auto-round \
      --kv-cache-dtype fp8 \
      --max-model-len 200000 \
      --gpu-memory-utilization 0.88 \
      --cpu-offload-gb 4 \
      --block-size 32 \
      --max-num-seqs 8 \
      --max-num-batched-tokens 8192 \
      --swap-space 32 \
      --enable-expert-parallel \
      --enable-prefix-caching \
      --enable-chunked-prefill \
      --disable-custom-all-reduce \
      --disable-log-requests \
      --trust-remote-code

SGLang

python3 -m sglang.launch_server \
    --model-path 0xSero/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
    --tp-size 8 \
    --trust-remote-code

Hardware Requirements

Configuration VRAM Required Notes
8x 24GB GPUs ~192GB total TP=4, PP=2, recommended
4x 48GB GPUs ~192GB total TP=4, no PP needed
8x 48GB GPUs ~384GB total Full speed, larger batches

Minimum: 8x 24GB GPUs (RTX 3090/4090) or equivalent ~192GB total VRAM.

Quantization Details

Method

Quantized using Intel AutoRound with the following configuration:

  • Scheme: W4A16 (4-bit weights, 16-bit activations)
  • Calibration samples: 64
  • Sequence length: 512
  • Batch size: 1

Quantization Script

#!/usr/bin/env python3
"""
GLM-4.6-REAP-218B W4A16 Quantization using Intel AutoRound

Produces SGLang/vLLM-compatible 4-bit quantized model.
"""

import logging
from datetime import datetime
from pathlib import Path

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)

MODEL_ID = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B"  # or "cerebras/GLM-4.6-REAP-218B-A32B"
OUTPUT_DIR = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound"


def main():
    logger.info("=" * 80)
    logger.info("GLM-4.6-REAP-218B W4A16 Quantization (Intel AutoRound)")
    logger.info("=" * 80)

    start = datetime.now()

    from auto_round import AutoRound

    logger.info(f"Model: {MODEL_ID}")
    logger.info(f"Output: {OUTPUT_DIR}")
    logger.info(f"Scheme: W4A16 (4-bit weights, 16-bit activations)")

    Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

    logger.info("Initializing AutoRound (CPU mode)...")
    autoround = AutoRound(
        MODEL_ID,
        scheme="W4A16",
        device="cpu",
        device_map="cpu",
        trust_remote_code=True,
        batch_size=1,
        seqlen=512,
        nsamples=64,
    )

    logger.info("Starting quantization...")
    autoround.quantize_and_save(OUTPUT_DIR, format="auto_round")

    elapsed = datetime.now() - start
    logger.info("=" * 80)
    logger.info(f"Done in {elapsed}")
    logger.info(f"Output: {OUTPUT_DIR}")
    logger.info("=" * 80)


if __name__ == "__main__":
    main()

About the Base Model

GLM-4.6-REAP-218B-A32B is a Mixture of Experts (MoE) model from Cerebras with:

  • 218 billion total parameters
  • 32 billion active parameters per forward pass
  • Strong performance on reasoning and long-context tasks
  • Native support for 128k+ context windows

For more details, see the base model card.

Limitations

  • Quantization may slightly reduce quality compared to BF16
  • Requires significant VRAM (~192GB minimum across GPUs)
  • Best results with tensor parallelism across 4-8 GPUs

License

This quantized model inherits the license from the base model. See cerebras/GLM-4.6-REAP-218B-A32B for licensing details.

Acknowledgments

  • Cerebras for the base GLM-4.6-REAP model
  • Intel for the AutoRound quantization toolkit
  • vLLM and SGLang teams for inference support

Citation

If you use this model, please cite the original:

@misc{glm46reap,
  title={GLM-4.6-REAP-218B-A32B},
  author={Cerebras},
  year={2024},
  url={https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B}
}
Downloads last month
408
Safetensors
Model size
2B params
Tensor type
F32
I32
F16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for 0xSero/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound

Quantized
(7)
this model