Update README.md
Browse files
README.md
CHANGED
|
@@ -66,20 +66,24 @@ The model correctly recalled all embedded facts from a long context:
|
|
| 66 |
### vLLM (Recommended)
|
| 67 |
|
| 68 |
```bash
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
```
|
| 84 |
|
| 85 |
### SGLang
|
|
|
|
| 66 |
### vLLM (Recommended)
|
| 67 |
|
| 68 |
```bash
|
| 69 |
+
vllm serve /GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
|
| 70 |
+
--host 0.0.0.0 --port 8000 \
|
| 71 |
+
--tensor-parallel-size 4 --pipeline-parallel-size 2 \
|
| 72 |
+
--quantization auto-round \
|
| 73 |
+
--kv-cache-dtype fp8 \
|
| 74 |
+
--max-model-len 200000 \
|
| 75 |
+
--gpu-memory-utilization 0.88 \
|
| 76 |
+
--cpu-offload-gb 4 \
|
| 77 |
+
--block-size 32 \
|
| 78 |
+
--max-num-seqs 8 \
|
| 79 |
+
--max-num-batched-tokens 8192 \
|
| 80 |
+
--swap-space 32 \
|
| 81 |
+
--enable-expert-parallel \
|
| 82 |
+
--enable-prefix-caching \
|
| 83 |
+
--enable-chunked-prefill \
|
| 84 |
+
--disable-custom-all-reduce \
|
| 85 |
+
--disable-log-requests \
|
| 86 |
+
--trust-remote-code
|
| 87 |
```
|
| 88 |
|
| 89 |
### SGLang
|