Instructions to use Qwen/Qwen-1_8B-Chat-Int8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen-1_8B-Chat-Int8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen-1_8B-Chat-Int8", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat-Int8", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Qwen/Qwen-1_8B-Chat-Int8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen-1_8B-Chat-Int8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen-1_8B-Chat-Int8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Qwen/Qwen-1_8B-Chat-Int8
- SGLang
How to use Qwen/Qwen-1_8B-Chat-Int8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen-1_8B-Chat-Int8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen-1_8B-Chat-Int8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen-1_8B-Chat-Int8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen-1_8B-Chat-Int8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Qwen/Qwen-1_8B-Chat-Int8 with Docker Model Runner:
docker model run hf.co/Qwen/Qwen-1_8B-Chat-Int8
- Qwen-1.8B-Chat-Int8
- ไป็ป๏ผIntroduction๏ผ
- ่ฆๆฑ๏ผRequirements๏ผ
- ไพ่ต้กน๏ผDependency๏ผ
- ๅฟซ้ไฝฟ็จ๏ผQuickstart๏ผ
- Tokenizer
- ้ๅ (Quantization)
- ๆจกๅ็ป่๏ผModel๏ผ
- ่ฏๆตๆๆ๏ผEvaluation๏ผ
- ่ฏๆตๅค็ฐ๏ผReproduction๏ผ
- FAQ
- ๅผ็จ (Citation)
- ไฝฟ็จๅ่ฎฎ๏ผLicense Agreement๏ผ
- ่็ณปๆไปฌ๏ผContact Us๏ผ
- ไป็ป๏ผIntroduction๏ผ
Qwen-1.8B-Chat-Int8
๐ค Hugging Face | ๐ค ModelScope | ๐ Paper ๏ฝ ๐ฅ๏ธ Demo
WeChat (ๅพฎไฟก) | Discord ๏ฝ API
ไป็ป๏ผIntroduction๏ผ
้ไนๅ้ฎ-1.8B๏ผQwen-1.8B๏ผๆฏ้ฟ้ไบ็ ๅ็้ไนๅ้ฎๅคงๆจกๅ็ณปๅ็18ไบฟๅๆฐ่งๆจก็ๆจกๅใQwen-1.8BๆฏๅบไบTransformer็ๅคง่ฏญ่จๆจกๅ, ๅจ่ถ ๅคง่งๆจก็้ข่ฎญ็ปๆฐๆฎไธ่ฟ่ก่ฎญ็ปๅพๅฐใ้ข่ฎญ็ปๆฐๆฎ็ฑปๅๅคๆ ท๏ผ่ฆ็ๅนฟๆณ๏ผๅ ๆฌๅคง้็ฝ็ปๆๆฌใไธไธไนฆ็ฑใไปฃ็ ็ญใๅๆถ๏ผๅจQwen-1.8B็ๅบ็กไธ๏ผๆไปฌไฝฟ็จๅฏน้ฝๆบๅถๆ้ ไบๅบไบๅคง่ฏญ่จๆจกๅ็AIๅฉๆQwen-1.8B-ChatใๆฌไปๅบไธบQwen-1.8B-Chat็Int8้ๅๆจกๅ็ไปๅบใ
้ไนๅ้ฎ-1.8B๏ผQwen-1.8B๏ผไธป่ฆๆไปฅไธ็น็น๏ผ
- ไฝๆๆฌ้จ็ฝฒ๏ผๆไพint8ๅint4้ๅ็ๆฌ๏ผๆจ็ๆไฝไป ้ไธๅฐ2GBๆพๅญ๏ผ็ๆ2048 tokensไป ้3GBๆพๅญๅ ็จใๅพฎ่ฐๆไฝไป ้6GBใ
- ๅคง่งๆจก้ซ่ดจ้่ฎญ็ป่ฏญๆ๏ผไฝฟ็จ่ถ ่ฟ2.2ไธไบฟtokens็ๆฐๆฎ่ฟ่ก้ข่ฎญ็ป๏ผๅ ๅซ้ซ่ดจ้ไธญใ่ฑใๅค่ฏญ่จใไปฃ็ ใๆฐๅญฆ็ญๆฐๆฎ๏ผๆถต็้็จๅไธไธ้ขๅ็่ฎญ็ป่ฏญๆใ้่ฟๅคง้ๅฏนๆฏๅฎ้ชๅฏน้ข่ฎญ็ป่ฏญๆๅๅธ่ฟ่กไบไผๅใ
- ไผ็ง็ๆง่ฝ๏ผQwen-1.8Bๆฏๆ8192ไธไธๆ้ฟๅบฆ๏ผๅจๅคไธชไธญ่ฑๆไธๆธธ่ฏๆตไปปๅกไธ๏ผๆถต็ๅธธ่ฏๆจ็ใไปฃ็ ใๆฐๅญฆใ็ฟป่ฏ็ญ๏ผ๏ผๆๆๆพ่่ถ ่ถ็ฐๆ็็ธ่ฟ่งๆจกๅผๆบๆจกๅ๏ผๅ ทไฝ่ฏๆต็ปๆ่ฏท่ฏฆ่งไธๆใ
- ่ฆ็ๆดๅ จ้ข็่ฏ่กจ๏ผ็ธๆฏ็ฎๅไปฅไธญ่ฑ่ฏ่กจไธบไธป็ๅผๆบๆจกๅ๏ผQwen-1.8Bไฝฟ็จไบ็บฆ15ไธๅคงๅฐ็่ฏ่กจใ่ฏฅ่ฏ่กจๅฏนๅค่ฏญ่จๆดๅ ๅๅฅฝ๏ผๆนไพฟ็จๆทๅจไธๆฉๅฑ่ฏ่กจ็ๆ ๅตไธๅฏน้จๅ่ฏญ็ง่ฟ่ก่ฝๅๅขๅผบๅๆฉๅฑใ
- ็ณป็ปๆไปค่ท้๏ผQwen-1.8B-Chatๅฏไปฅ้่ฟ่ฐๆด็ณป็ปๆไปค๏ผๅฎ็ฐ่ง่ฒๆฎๆผ๏ผ่ฏญ่จ้ฃๆ ผ่ฟ็งป๏ผไปปๅก่ฎพๅฎ๏ผๅ่กไธบ่ฎพๅฎ็ญ่ฝๅใ
ๅฆๆๆจๆณไบ่งฃๆดๅคๅ ณไบ้ไนๅ้ฎ1.8Bๅผๆบๆจกๅ็็ป่๏ผๆไปฌๅปบ่ฎฎๆจๅ้ GitHubไปฃ็ ๅบใ
Qwen-1.8B is the 1.8B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-1.8B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-1.8B, we release Qwen-1.8B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-1.8B-Chat-int8.
The features of Qwen-1.8B include:
- Low-cost deployment: We provide int4 and int8 quantized versions, the minimum memory requirment for inference is less than 2GB, generating 2048 tokens only 3GB of memory usage. The minimum memory requirment of finetuning is only 6GB.
- Large-scale high-quality training corpora: It is pretrained on over 2.2 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments.
- Good performance: It supports 8192 context length and significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
- More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-1.8B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
- System prompt: Qwen-1.8B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by using system prompt.
For more details about the open-source model of Qwen-1.8B-chat int8, please refer to the GitHub code repository.
่ฆๆฑ๏ผRequirements๏ผ
- python 3.8ๅไปฅไธ็ๆฌ
- pytorch 2.0ๅไปฅไธ็ๆฌ
- ๅปบ่ฎฎไฝฟ็จCUDA 11.4ๅไปฅไธ๏ผGPU็จๆทใflash-attention็จๆท็ญ้่่ๆญค้้กน๏ผ
- python 3.8 and above
- pytorch 2.0 and above
- CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
ไพ่ต้กน๏ผDependency๏ผ
่ฟ่กQwen-1.8B-Chat-Int8๏ผ่ฏท็กฎไฟๆปก่ถณไธ่ฟฐ่ฆๆฑ๏ผๅๆง่กไปฅไธpipๅฝไปคๅฎ่ฃ
ไพ่ตๅบใๅฆๅฎ่ฃ
auto-gptq้ๅฐ้ฎ้ข๏ผๆไปฌๅปบ่ฎฎๆจๅฐๅฎๆนrepoๆ็ดขๅ้็้ข็ผ่ฏwheelใ
To run Qwen-1.8B-Chat-Int8, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries. If you meet problems installing auto-gptq, we advise you to check out the official repo to find a pre-build wheel.
pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed
pip install auto-gptq optimum
ๅฆๅค๏ผๆจ่ๅฎ่ฃ
flash-attentionๅบ๏ผๅฝๅๅทฒๆฏๆflash attention 2๏ผ๏ผไปฅๅฎ็ฐๆด้ซ็ๆ็ๅๆดไฝ็ๆพๅญๅ ็จใ
In addition, it is recommended to install the flash-attention library (we support flash attention 2 now.) for higher efficiency and lower memory usage.
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# ไธๆนๅฎ่ฃ
ๅฏ้๏ผๅฎ่ฃ
ๅฏ่ฝๆฏ่พ็ผๆ
ขใ
# pip install csrc/layer_norm
# pip install csrc/rotary
ๅฟซ้ไฝฟ็จ๏ผQuickstart๏ผ
ไธ้ขๆไปฌๅฑ็คบไบไธไธชไฝฟ็จQwen-1.8B-Chat-Int8ๆจกๅ๏ผ่ฟ่กๅค่ฝฎๅฏน่ฏไบคไบ็ๆ ทไพ๏ผ
We show an example of multi-turn interaction with Qwen-1.8B-Chat-Int8 in the following code:
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-1_8B-Chat-Int8",
device_map="auto",
trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)
print(response)
# ไฝ ๅฅฝ๏ผๅพ้ซๅ
ดไธบไฝ ๆไพๅธฎๅฉใ
# Qwen-1.8B-Chat็ฐๅจๅฏไปฅ้่ฟ่ฐๆด็ณป็ปๆไปค๏ผSystem Prompt๏ผ๏ผๅฎ็ฐ่ง่ฒๆฎๆผ๏ผ่ฏญ่จ้ฃๆ ผ่ฟ็งป๏ผไปปๅก่ฎพๅฎ๏ผ่กไธบ่ฎพๅฎ็ญ่ฝๅใ
# Qwen-1.8B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by system prompt.
response, _ = model.chat(tokenizer, "ไฝ ๅฅฝๅ", history=None, system="่ฏท็จไบๆฌกๅ
ๅฏ็ฑ่ฏญๆฐๅๆ่ฏด่ฏ")
print(response)
# ไฝ ๅฅฝๅ๏ผๆๆฏไธๅชๅฏ็ฑ็ไบๆฌกๅ
็ซๅชๅฆ๏ผไธ็ฅ้ไฝ ๆไปไน้ฎ้ข้่ฆๆๅธฎๅฟ่งฃ็ญๅ๏ผ
response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")
print(response)
# Your colleague is an outstanding worker! Their dedication and hard work are truly inspiring. They always go above and beyond to ensure that
# their tasks are completed on time and to the highest standard. I am lucky to have them as a colleague, and I know I can count on them to handle any challenge that comes their way.
ๅ ณไบๆดๅค็ไฝฟ็จ่ฏดๆ๏ผ่ฏทๅ่ๆไปฌ็GitHub repo่ทๅๆดๅคไฟกๆฏใ
For more information, please refer to our GitHub repo for more information.
Tokenizer
ๆณจ๏ผไฝไธบๆฏ่ฏญ็โtokenizationโๅจไธญๆไธญๅฐๆ ๅ ฑ่ฏ็ๆฆๅฟตๅฏนๅบ๏ผๆฌๆๆกฃ้็จ่ฑๆ่กจ่พพไปฅๅฉ่ฏดๆใ
ๅบไบtiktoken็ๅ่ฏๅจๆๅซไบๅ ถไปๅ่ฏๅจ๏ผๆฏๅฆsentencepieceๅ่ฏๅจใๅฐคๅ ถๅจๅพฎ่ฐ้ถๆฎต๏ผ้่ฆ็นๅซๆณจๆ็นๆฎtoken็ไฝฟ็จใๅ ณไบtokenizer็ๆดๅคไฟกๆฏ๏ผไปฅๅๅพฎ่ฐๆถๆถๅ็็ธๅ ณไฝฟ็จ๏ผ่ฏทๅ้ ๆๆกฃใ
Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the documentation.
้ๅ (Quantization)
็จๆณ (Usage)
่ฏทๆณจๆ๏ผๆไปฌๆดๆฐ้ๅๆนๆกไธบๅบไบAutoGPTQ็้ๅ๏ผๆไพQwen-1.8B-Chat็Int8ๅๆจกๅ็นๅป่ฟ้ใ็ธๆฏๆญคๅๆนๆก๏ผ่ฏฅๆนๆกๅจๆจกๅ่ฏๆตๆๆๅ ไนๆ ๆ๏ผไธๅญๅจ้ๆฑๆดไฝ๏ผๆจ็้ๅบฆๆดไผใ
Note: we provide a new solution based on AutoGPTQ, and release an Int8 quantized model for Qwen-1.8B-Chat Click here, which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.
ไปฅไธๆไปฌๆไพ็คบไพ่ฏดๆๅฆไฝไฝฟ็จInt8้ๅๆจกๅใๅจๅผๅงไฝฟ็จๅ๏ผ่ฏทๅ ไฟ่ฏๆปก่ถณ่ฆๆฑ๏ผๅฆtorch 2.0ๅไปฅไธ๏ผtransformers็ๆฌไธบ4.32.0ๅไปฅไธ๏ผ็ญ็ญ๏ผ๏ผๅนถๅฎ่ฃ ๆ้ๅฎ่ฃ ๅ ๏ผ
Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:
pip install auto-gptq optimum
ๅฆๅฎ่ฃ
auto-gptq้ๅฐ้ฎ้ข๏ผๆไปฌๅปบ่ฎฎๆจๅฐๅฎๆนrepoๆ็ดขๅ้็้ข็ผ่ฏwheelใ
้ๅๅณๅฏไฝฟ็จๅไธ่ฟฐไธ่ด็็จๆณ่ฐ็จ้ๅๆจกๅ๏ผ
If you meet problems installing auto-gptq, we advise you to check out the official repo to find a pre-build wheel.
Then you can load the quantized model easily and run inference as same as usual:
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-1_8B-Chat-Int8",
device_map="auto",
trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)
ๆๆ่ฏๆต
ๆไปฌไฝฟ็จๅๅงๆจกๅ็FP32ๅBF16็ฒพๅบฆ๏ผไปฅๅ้ๅ่ฟ็Int8ๅInt4ๆจกๅๅจๅบๅ่ฏๆตไธๅไบๆต่ฏ๏ผ็ปๆๅฆไธๆ็คบ๏ผ
We illustrate the model performance of both FP32, BF16, Int8 and Int4 models on the benchmark. Results are shown below:
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|---|---|---|---|---|
| FP32 | 43.4 | 57.0 | 33.0 | 26.8 |
| BF16 | 43.3 | 55.6 | 33.7 | 26.2 |
| Int8 | 43.1 | 55.8 | 33.0 | 27.4 |
| Int4 | 42.9 | 52.8 | 31.2 | 25.0 |
ๆจ็้ๅบฆ (Inference Speed)
ๆไปฌๆต็ฎไบFP32ใBF16็ฒพๅบฆๅInt8ใInt4้ๅๆจกๅ็ๆ2048ๅ8192ไธชtoken็ๅนณๅๆจ็้ๅบฆใๅฆๅพๆ็คบ๏ผ
We measured the average inference speed of generating 2048 and 8192 tokens under FP32, BF16 precision and Int8, Int4 quantization level, respectively.
| Quantization | FlashAttn | Speed (2048 tokens) | Speed (8192 tokens) |
|---|---|---|---|
| FP32 | v2 | 52.96 | 47.35 |
| BF16 | v2 | 54.09 | 54.04 |
| Int8 | v2 | 55.56 | 55.62 |
| Int4 | v2 | 71.07 | 76.45 |
| FP32 | v1 | 52.00 | 45.80 |
| BF16 | v1 | 51.70 | 55.04 |
| Int8 | v1 | 53.16 | 53.33 |
| Int4 | v1 | 69.82 | 67.44 |
| FP32 | Disabled | 52.28 | 44.95 |
| BF16 | Disabled | 48.17 | 45.01 |
| Int8 | Disabled | 52.16 | 52.99 |
| Int4 | Disabled | 68.37 | 65.94 |
ๅ ทไฝ่่จ๏ผๆไปฌ่ฎฐๅฝๅจ้ฟๅบฆไธบ1็ไธไธๆ็ๆกไปถไธ็ๆ8192ไธชtoken็ๆง่ฝใ่ฏๆต่ฟ่กไบๅๅผ A100-SXM4-80G GPU๏ผไฝฟ็จPyTorch 2.0.1ๅCUDA 11.4ใๆจ็้ๅบฆๆฏ็ๆ8192ไธชtoken็้ๅบฆๅๅผใ
In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
ๆพๅญไฝฟ็จ (GPU Memory Usage)
ๆไปฌๆต็ฎไบFP32ใBF16็ฒพๅบฆๅInt8ใInt4้ๅๆจกๅ็ๆ2048ไธชๅ8192ไธชtoken๏ผๅไธชtokenไฝไธบ่พๅ ฅ๏ผ็ๅณฐๅผๆพๅญๅ ็จๆ ๅตใ็ปๆๅฆไธๆ็คบ๏ผ
We also profile the peak GPU memory usage for generating 2048 tokens and 8192 tokens (with single token as context) under FP32, BF16 or Int8, Int4 quantization level, respectively. The results are shown below.
| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
|---|---|---|
| FP32 | 8.45GB | 13.06GB |
| BF16 | 4.23GB | 6.48GB |
| Int8 | 3.48GB | 5.34GB |
| Int4 | 2.91GB | 4.80GB |
ไธ่ฟฐๆง่ฝๆต็ฎไฝฟ็จๆญค่ๆฌๅฎๆใ
The above speed and memory profiling are conducted using this script.
ๆจกๅ็ป่๏ผModel๏ผ
ไธQwen-1.8B้ข่ฎญ็ปๆจกๅ็ธๅ๏ผQwen-1.8B-Chatๆจกๅ่งๆจกๅบๆฌๆ ๅตๅฆไธๆ็คบ
The details of the model architecture of Qwen-1.8B-Chat are listed as follows
| Hyperparameter | Value |
|---|---|
| n_layers | 24 |
| n_heads | 16 |
| d_model | 2048 |
| vocab size | 151851 |
| sequence length | 8192 |
ๅจไฝ็ฝฎ็ผ็ ใFFNๆฟๆดปๅฝๆฐๅnormalization็ๅฎ็ฐๆนๅผไธ๏ผๆไปฌไน้็จไบ็ฎๅๆๆต่ก็ๅๆณ๏ผ ๅณRoPE็ธๅฏนไฝ็ฝฎ็ผ็ ใSwiGLUๆฟๆดปๅฝๆฐใRMSNorm๏ผๅฏ้ๅฎ่ฃ flash-attentionๅ ้๏ผใ
ๅจๅ่ฏๅจๆน้ข๏ผ็ธๆฏ็ฎๅไธปๆตๅผๆบๆจกๅไปฅไธญ่ฑ่ฏ่กจไธบไธป๏ผQwen-1.8B-Chatไฝฟ็จไบ็บฆ15ไธtokenๅคงๅฐ็่ฏ่กจใ
่ฏฅ่ฏ่กจๅจGPT-4ไฝฟ็จ็BPE่ฏ่กจcl100k_baseๅบ็กไธ๏ผๅฏนไธญๆใๅค่ฏญ่จ่ฟ่กไบไผๅ๏ผๅจๅฏนไธญใ่ฑใไปฃ็ ๆฐๆฎ็้ซๆ็ผ่งฃ็ ็ๅบ็กไธ๏ผๅฏน้จๅๅค่ฏญ่จๆดๅ ๅๅฅฝ๏ผๆนไพฟ็จๆทๅจไธๆฉๅฑ่ฏ่กจ็ๆ
ๅตไธๅฏน้จๅ่ฏญ็ง่ฟ่ก่ฝๅๅขๅผบใ
่ฏ่กจๅฏนๆฐๅญๆๅไธชๆฐๅญไฝๅๅใ่ฐ็จ่พไธบ้ซๆ็tiktokenๅ่ฏๅบ่ฟ่กๅ่ฏใ
For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).
For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-1.8B-Chat uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization.
่ฏๆตๆๆ๏ผEvaluation๏ผ
ๅฏนไบQwen-1.8B-Chatๆจกๅ๏ผๆไปฌๅๆ ท่ฏๆตไบๅธธ่ง็ไธญๆ็่งฃ๏ผC-Eval๏ผใ่ฑๆ็่งฃ๏ผMMLU๏ผใไปฃ็ ๏ผHumanEval๏ผๅๆฐๅญฆ๏ผGSM8K๏ผ็ญๆๅจไปปๅก๏ผๅๆถๅ ๅซไบ้ฟๅบๅไปปๅก็่ฏๆต็ปๆใ็ฑไบQwen-1.8B-Chatๆจกๅ็ป่ฟๅฏน้ฝๅ๏ผๆฟๅไบ่พๅผบ็ๅค้จ็ณป็ป่ฐ็จ่ฝๅ๏ผๆไปฌ่ฟ่ฟ่กไบๅทฅๅ ทไฝฟ็จ่ฝๅๆน้ข็่ฏๆตใ
ๆ็คบ๏ผ็ฑไบ็กฌไปถๅๆกๆถ้ ๆ็่ๅ ฅ่ฏฏๅทฎ๏ผๅค็ฐ็ปๆๅฆๆๆณขๅจๅฑไบๆญฃๅธธ็ฐ่ฑกใ
For Qwen-1.8B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage.
Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible.
ไธญๆ่ฏๆต๏ผChinese Evaluation๏ผ
C-Eval
ๅจC-Eval้ช่ฏ้ไธ๏ผๆไปฌ่ฏไปทไบQwen-1.8B-Chatๆจกๅ็ๅ็กฎ็
We demonstrate the accuracy of Qwen-1.8B-Chat on C-Eval validation set
| Model | Acc. |
|---|---|
| RedPajama-INCITE-Chat-3B | 18.3 |
| OpenBuddy-3B | 23.5 |
| Firefly-Bloom-1B4 | 23.6 |
| OpenLLaMA-Chinese-3B | 24.4 |
| LLaMA2-7B-Chat | 31.9 |
| ChatGLM2-6B-Chat | 52.6 |
| InternLM-7B-Chat | 53.6 |
| Qwen-1.8B-Chat (0-shot) | 55.6 |
| Qwen-7B-Chat (0-shot) | 59.7 |
| Qwen-7B-Chat (5-shot) | 59.3 |
C-Evalๆต่ฏ้ไธ๏ผQwen-1.8B-Chatๆจกๅ็zero-shotๅ็กฎ็็ปๆๅฆไธ๏ผ
The zero-shot accuracy of Qwen-1.8B-Chat on C-Eval testing set is provided below:
| Model | Avg. | STEM | Social Sciences | Humanities | Others |
|---|---|---|---|---|---|
| Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
| Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
| ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
| Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
| Qwen-1.8B-Chat | 53.8 | 48.4 | 68.0 | 56.5 | 48.3 |
| Qwen-7B-Chat | 58.6 | 53.3 | 72.1 | 62.8 | 52.0 |
่ฑๆ่ฏๆต๏ผEnglish Evaluation๏ผ
MMLU
MMLU่ฏๆต้ไธ๏ผQwen-1.8B-Chatๆจกๅ็ๅ็กฎ็ๅฆไธ๏ผๆๆๅๆ ทๅจๅ็ฑปๅฏน้ฝๆจกๅไธญๅๆ ท่กจ็ฐ่พไผใ
The accuracy of Qwen-1.8B-Chat on MMLU is provided below. The performance of Qwen-1.8B-Chat still on the top between other human-aligned models with comparable size.
| Model | Acc. |
|---|---|
| Firefly-Bloom-1B4 | 23.8 |
| OpenBuddy-3B | 25.5 |
| RedPajama-INCITE-Chat-3B | 25.5 |
| OpenLLaMA-Chinese-3B | 25.7 |
| ChatGLM2-6B-Chat | 46.0 |
| LLaMA2-7B-Chat | 46.2 |
| InternLM-7B-Chat | 51.1 |
| Baichuan2-7B-Chat | 52.9 |
| Qwen-1.8B-Chat (0-shot) | 43.3 |
| Qwen-7B-Chat (0-shot) | 55.8 |
| Qwen-7B-Chat (5-shot) | 57.0 |
ไปฃ็ ่ฏๆต๏ผCoding Evaluation๏ผ
Qwen-1.8B-ChatๅจHumanEval็zero-shot Pass@1ๆๆๅฆไธ
The zero-shot Pass@1 of Qwen-1.8B-Chat on HumanEval is demonstrated below
| Model | Pass@1 |
|---|---|
| Firefly-Bloom-1B4 | 0.6 |
| OpenLLaMA-Chinese-3B | 4.9 |
| RedPajama-INCITE-Chat-3B | 6.1 |
| OpenBuddy-3B | 10.4 |
| ChatGLM2-6B-Chat | 11.0 |
| LLaMA2-7B-Chat | 12.2 |
| Baichuan2-7B-Chat | 13.4 |
| InternLM-7B-Chat | 14.6 |
| Qwen-1.8B-Chat | 26.2 |
| Qwen-7B-Chat | 37.2 |
ๆฐๅญฆ่ฏๆต๏ผMathematics Evaluation๏ผ
ๅจ่ฏๆตๆฐๅญฆ่ฝๅ็GSM8Kไธ๏ผQwen-1.8B-Chat็ๅ็กฎ็็ปๆๅฆไธ
The accuracy of Qwen-1.8B-Chat on GSM8K is shown below
| Model | Acc. |
|---|---|
| Firefly-Bloom-1B4 | 2.4 |
| RedPajama-INCITE-Chat-3B | 2.5 |
| OpenLLaMA-Chinese-3B | 3.0 |
| OpenBuddy-3B | 12.6 |
| LLaMA2-7B-Chat | 26.3 |
| ChatGLM2-6B-Chat | 28.8 |
| Baichuan2-7B-Chat | 32.8 |
| InternLM-7B-Chat | 33.0 |
| Qwen-1.8B-Chat (0-shot) | 33.7 |
| Qwen-7B-Chat (0-shot) | 50.3 |
| Qwen-7B-Chat (8-shot) | 54.1 |
่ฏๆตๅค็ฐ๏ผReproduction๏ผ
ๆไปฌๆไพไบ่ฏๆต่ๆฌ๏ผๆนไพฟๅคงๅฎถๅค็ฐๆจกๅๆๆ๏ผ่ฏฆ่ง้พๆฅใๆ็คบ๏ผ็ฑไบ็กฌไปถๅๆกๆถ้ ๆ็่ๅ ฅ่ฏฏๅทฎ๏ผๅค็ฐ็ปๆๅฆๆๅฐๅน ๆณขๅจๅฑไบๆญฃๅธธ็ฐ่ฑกใ
We have provided evaluation scripts to reproduce the performance of our model, details as link.
FAQ
ๅฆ้ๅฐ้ฎ้ข๏ผๆฌ่ฏทๆฅ้ FAQไปฅๅissueๅบ๏ผๅฆไปๆ ๆณ่งฃๅณๅๆไบคissueใ
If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.
ๅผ็จ (Citation)
ๅฆๆไฝ ่งๅพๆไปฌ็ๅทฅไฝๅฏนไฝ ๆๅธฎๅฉ๏ผๆฌข่ฟๅผ็จ๏ผ
If you find our work helpful, feel free to give us a cite.
@article{qwen,
title={Qwen Technical Report},
author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
journal={arXiv preprint arXiv:2309.16609},
year={2023}
}
ไฝฟ็จๅ่ฎฎ๏ผLicense Agreement๏ผ
ๆไปฌ็ไปฃ็ ๅๆจกๅๆ้ๅฏนๅญฆๆฏ็ ็ฉถๅฎๅ จๅผๆพใ่ฏทๆฅ็LICENSEๆไปถไบ่งฃๅ ทไฝ็ๅผๆบๅ่ฎฎ็ป่ใๅฆ้ๅ็จ๏ผ่ฏท่็ณปๆไปฌใ
Our code and checkpoints are open to research purpose. Check the LICENSE for more details about the license. For commercial use, please contact us.
่็ณปๆไปฌ๏ผContact Us๏ผ
ๅฆๆไฝ ๆณ็ปๆไปฌ็็ ๅๅข้ๅไบงๅๅข้็่จ๏ผๆฌข่ฟๅ ๅ ฅๆไปฌ็ๅพฎไฟก็พคใ้้็พคไปฅๅDiscord๏ผๅๆถ๏ผไนๆฌข่ฟ้่ฟ้ฎไปถ๏ผqianwen_opensource@alibabacloud.com๏ผ่็ณปๆไปฌใ
If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com.
- Downloads last month
- 112