GLM-4.6-REAP-218B-A32B-W4A16-AutoRound / README.md

0xSero

Update README.md

5463b14 verified 9 days ago

preview code

raw

history blame contribute delete

6.03 kB

metadata

license: apache-2.0
base_model: cerebras/GLM-4.6-REAP-218B-A32B
library_name: transformers
tags:
  - glm
  - moe
  - mixture-of-experts
  - autoround
  - quantized
  - 4-bit
  - w4a16
  - vllm
  - sglang
  - cerebras
model_type: glm4
pipeline_tag: text-generation
quantized_by: 0xSero
inference: false

GLM-4.6-REAP-218B-A32B W4A16 (AutoRound Quantization)

This is a 4-bit quantized version of cerebras/GLM-4.6-REAP-218B-A32B using Intel AutoRound.

Model Details

Property	Value
Base Model	cerebras/GLM-4.6-REAP-218B-A32B
Quantization	W4A16 (4-bit weights, 16-bit activations)
Method	Intel AutoRound
Format	auto_round (compatible with vLLM, SGLang)
Architecture	GLM-4 Mixture of Experts
Total Parameters	218B
Active Parameters	32B (A32B)
Original Size	~436 GB (BF16)
Quantized Size	~116 GB

Performance Benchmarks

Tested on 8x NVIDIA RTX 3090 (24GB each) with vLLM:

Speed Test (~20k context)

Metric	Value
Prompt Tokens	~21,178
Completion Tokens	393
Time to First Token (TTFB)	23.82s
Total Generation Time	36.45s
Prefill Speed	~889 tok/s
Generation Speed	~31 tok/s

Coherence Test

The model correctly recalled all embedded facts from a long context:

Character name: Aurelia
Product code: ZX-42-ALPHA
Transaction amount: 7,530,000 credits
Scientist name: Dr. Linh Tran
Date: 2025-12-15

Usage

vLLM (Recommended)

  vllm serve /GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
      --host 0.0.0.0 --port 8000 \
      --tensor-parallel-size 4 --pipeline-parallel-size 2 \
      --quantization auto-round \
      --kv-cache-dtype fp8 \
      --max-model-len 200000 \
      --gpu-memory-utilization 0.88 \
      --cpu-offload-gb 4 \
      --block-size 32 \
      --max-num-seqs 8 \
      --max-num-batched-tokens 8192 \
      --swap-space 32 \
      --enable-expert-parallel \
      --enable-prefix-caching \
      --enable-chunked-prefill \
      --disable-custom-all-reduce \
      --disable-log-requests \
      --trust-remote-code

SGLang

python3 -m sglang.launch_server \
    --model-path 0xSero/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
    --tp-size 8 \
    --trust-remote-code

Hardware Requirements

Configuration	VRAM Required	Notes
8x 24GB GPUs	~192GB total	TP=4, PP=2, recommended
4x 48GB GPUs	~192GB total	TP=4, no PP needed
8x 48GB GPUs	~384GB total	Full speed, larger batches

Minimum: 8x 24GB GPUs (RTX 3090/4090) or equivalent ~192GB total VRAM.

Quantization Details

Method

Quantized using Intel AutoRound with the following configuration:

Scheme: W4A16 (4-bit weights, 16-bit activations)
Calibration samples: 64
Sequence length: 512
Batch size: 1

Quantization Script

#!/usr/bin/env python3
"""
GLM-4.6-REAP-218B W4A16 Quantization using Intel AutoRound

Produces SGLang/vLLM-compatible 4-bit quantized model.
"""

import logging
from datetime import datetime
from pathlib import Path

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)

MODEL_ID = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B"  # or "cerebras/GLM-4.6-REAP-218B-A32B"
OUTPUT_DIR = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound"


def main():
    logger.info("=" * 80)
    logger.info("GLM-4.6-REAP-218B W4A16 Quantization (Intel AutoRound)")
    logger.info("=" * 80)

    start = datetime.now()

    from auto_round import AutoRound

    logger.info(f"Model: {MODEL_ID}")
    logger.info(f"Output: {OUTPUT_DIR}")
    logger.info(f"Scheme: W4A16 (4-bit weights, 16-bit activations)")

    Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

    logger.info("Initializing AutoRound (CPU mode)...")
    autoround = AutoRound(
        MODEL_ID,
        scheme="W4A16",
        device="cpu",
        device_map="cpu",
        trust_remote_code=True,
        batch_size=1,
        seqlen=512,
        nsamples=64,
    )

    logger.info("Starting quantization...")
    autoround.quantize_and_save(OUTPUT_DIR, format="auto_round")

    elapsed = datetime.now() - start
    logger.info("=" * 80)
    logger.info(f"Done in {elapsed}")
    logger.info(f"Output: {OUTPUT_DIR}")
    logger.info("=" * 80)


if __name__ == "__main__":
    main()

About the Base Model

GLM-4.6-REAP-218B-A32B is a Mixture of Experts (MoE) model from Cerebras with:

218 billion total parameters
32 billion active parameters per forward pass
Strong performance on reasoning and long-context tasks
Native support for 128k+ context windows

For more details, see the base model card.

Limitations

Quantization may slightly reduce quality compared to BF16
Requires significant VRAM (~192GB minimum across GPUs)
Best results with tensor parallelism across 4-8 GPUs

License

This quantized model inherits the license from the base model. See cerebras/GLM-4.6-REAP-218B-A32B for licensing details.

Acknowledgments

Cerebras for the base GLM-4.6-REAP model
Intel for the AutoRound quantization toolkit
vLLM and SGLang teams for inference support

Citation

If you use this model, please cite the original:

@misc{glm46reap,
  title={GLM-4.6-REAP-218B-A32B},
  author={Cerebras},
  year={2024},
  url={https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B}
}