0xSero commited on
Commit
5463b14
·
verified ·
1 Parent(s): 9105958

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -14
README.md CHANGED
@@ -66,20 +66,24 @@ The model correctly recalled all embedded facts from a long context:
66
  ### vLLM (Recommended)
67
 
68
  ```bash
69
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve 0xSero/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
70
- --host 0.0.0.0 --port 8000 \
71
- --tensor-parallel-size 4 --pipeline-parallel-size 2 \
72
- --quantization auto-round \
73
- --kv-cache-dtype fp8 \
74
- --max-model-len 180000 \
75
- --gpu-memory-utilization 0.82 \
76
- --block-size 32 \
77
- --max-num-seqs 12 \
78
- --max-num-batched-tokens 8192 \
79
- --swap-space 32 \
80
- --enable-expert-parallel \
81
- --disable-custom-all-reduce \
82
- --disable-log-requests
 
 
 
 
83
  ```
84
 
85
  ### SGLang
 
66
  ### vLLM (Recommended)
67
 
68
  ```bash
69
+ vllm serve /GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
70
+ --host 0.0.0.0 --port 8000 \
71
+ --tensor-parallel-size 4 --pipeline-parallel-size 2 \
72
+ --quantization auto-round \
73
+ --kv-cache-dtype fp8 \
74
+ --max-model-len 200000 \
75
+ --gpu-memory-utilization 0.88 \
76
+ --cpu-offload-gb 4 \
77
+ --block-size 32 \
78
+ --max-num-seqs 8 \
79
+ --max-num-batched-tokens 8192 \
80
+ --swap-space 32 \
81
+ --enable-expert-parallel \
82
+ --enable-prefix-caching \
83
+ --enable-chunked-prefill \
84
+ --disable-custom-all-reduce \
85
+ --disable-log-requests \
86
+ --trust-remote-code
87
  ```
88
 
89
  ### SGLang