Spaces:
Runtime error
Runtime error
| # GPTQ 4bit Inference | |
| Support GPTQ 4bit inference with [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa). | |
| 1. Window user: use the `old-cuda` branch. | |
| 2. Linux user: recommend the `fastest-inference-4bit` branch. | |
| ## Install | |
| Setup environment: | |
| ```bash | |
| # cd /path/to/FastChat | |
| git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git repositories/GPTQ-for-LLaMa | |
| cd repositories/GPTQ-for-LLaMa | |
| # Window's user should use the `old-cuda` branch | |
| git switch fastest-inference-4bit | |
| # Install `quant-cuda` package in FastChat's virtualenv | |
| python3 setup_cuda.py install | |
| pip3 install texttable | |
| ``` | |
| Chat with the CLI: | |
| ```bash | |
| python3 -m fastchat.serve.cli \ | |
| --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \ | |
| --gptq-wbits 4 \ | |
| --gptq-groupsize 128 | |
| ``` | |
| Start model worker: | |
| ```bash | |
| # Download quantized model from huggingface | |
| # Make sure you have git-lfs installed (https://git-lfs.com) | |
| git lfs install | |
| git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g | |
| python3 -m fastchat.serve.model_worker \ | |
| --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \ | |
| --gptq-wbits 4 \ | |
| --gptq-groupsize 128 | |
| # You can specify which quantized model to use | |
| python3 -m fastchat.serve.model_worker \ | |
| --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \ | |
| --gptq-ckpt models/vicuna-7B-1.1-GPTQ-4bit-128g/vicuna-7B-1.1-GPTQ-4bit-128g.safetensors \ | |
| --gptq-wbits 4 \ | |
| --gptq-groupsize 128 \ | |
| --gptq-act-order | |
| ``` | |
| ## Benchmark | |
| | LLaMA-13B | branch | Bits | group-size | memory(MiB) | PPL(c4) | Median(s/token) | act-order | speed up | | |
| | --------- | ---------------------- | ---- | ---------- | ----------- | ------- | --------------- | --------- | -------- | | |
| | FP16 | fastest-inference-4bit | 16 | - | 26634 | 6.96 | 0.0383 | - | 1x | | |
| | GPTQ | triton | 4 | 128 | 8590 | 6.97 | 0.0551 | - | 0.69x | | |
| | GPTQ | fastest-inference-4bit | 4 | 128 | 8699 | 6.97 | 0.0429 | true | 0.89x | | |
| | GPTQ | fastest-inference-4bit | 4 | 128 | 8699 | 7.03 | 0.0287 | false | 1.33x | | |
| | GPTQ | fastest-inference-4bit | 4 | -1 | 8448 | 7.12 | 0.0284 | false | 1.44x | | |