Qwen-1.8B-Chat-Int8


๐Ÿค— Hugging Face   |   ๐Ÿค– ModelScope   |    ๐Ÿ“‘ Paper    ๏ฝœ   ๐Ÿ–ฅ๏ธ Demo
WeChat (ๅพฎไฟก)   |   Discord   ๏ฝœ   API


ไป‹็ป๏ผˆIntroduction๏ผ‰

้€šไน‰ๅƒ้—ฎ-1.8B๏ผˆQwen-1.8B๏ผ‰ๆ˜ฏ้˜ฟ้‡Œไบ‘็ ”ๅ‘็š„้€šไน‰ๅƒ้—ฎๅคงๆจกๅž‹็ณปๅˆ—็š„18ไบฟๅ‚ๆ•ฐ่ง„ๆจก็š„ๆจกๅž‹ใ€‚Qwen-1.8Bๆ˜ฏๅŸบไบŽTransformer็š„ๅคง่ฏญ่จ€ๆจกๅž‹, ๅœจ่ถ…ๅคง่ง„ๆจก็š„้ข„่ฎญ็ปƒๆ•ฐๆฎไธŠ่ฟ›่กŒ่ฎญ็ปƒๅพ—ๅˆฐใ€‚้ข„่ฎญ็ปƒๆ•ฐๆฎ็ฑปๅž‹ๅคšๆ ท๏ผŒ่ฆ†็›–ๅนฟๆณ›๏ผŒๅŒ…ๆ‹ฌๅคง้‡็ฝ‘็ปœๆ–‡ๆœฌใ€ไธ“ไธšไนฆ็ฑใ€ไปฃ็ ็ญ‰ใ€‚ๅŒๆ—ถ๏ผŒๅœจQwen-1.8B็š„ๅŸบ็ก€ไธŠ๏ผŒๆˆ‘ไปฌไฝฟ็”จๅฏน้ฝๆœบๅˆถๆ‰“้€ ไบ†ๅŸบไบŽๅคง่ฏญ่จ€ๆจกๅž‹็š„AIๅŠฉๆ‰‹Qwen-1.8B-Chatใ€‚ๆœฌไป“ๅบ“ไธบQwen-1.8B-Chat็š„Int8้‡ๅŒ–ๆจกๅž‹็š„ไป“ๅบ“ใ€‚

้€šไน‰ๅƒ้—ฎ-1.8B๏ผˆQwen-1.8B๏ผ‰ไธป่ฆๆœ‰ไปฅไธ‹็‰น็‚น๏ผš

  1. ไฝŽๆˆๆœฌ้ƒจ็ฝฒ๏ผšๆไพ›int8ๅ’Œint4้‡ๅŒ–็‰ˆๆœฌ๏ผŒๆŽจ็†ๆœ€ไฝŽไป…้œ€ไธๅˆฐ2GBๆ˜พๅญ˜๏ผŒ็”Ÿๆˆ2048 tokensไป…้œ€3GBๆ˜พๅญ˜ๅ ็”จใ€‚ๅพฎ่ฐƒๆœ€ไฝŽไป…้œ€6GBใ€‚
  2. ๅคง่ง„ๆจก้ซ˜่ดจ้‡่ฎญ็ปƒ่ฏญๆ–™๏ผšไฝฟ็”จ่ถ…่ฟ‡2.2ไธ‡ไบฟtokens็š„ๆ•ฐๆฎ่ฟ›่กŒ้ข„่ฎญ็ปƒ๏ผŒๅŒ…ๅซ้ซ˜่ดจ้‡ไธญใ€่‹ฑใ€ๅคš่ฏญ่จ€ใ€ไปฃ็ ใ€ๆ•ฐๅญฆ็ญ‰ๆ•ฐๆฎ๏ผŒๆถต็›–้€š็”จๅŠไธ“ไธš้ข†ๅŸŸ็š„่ฎญ็ปƒ่ฏญๆ–™ใ€‚้€š่ฟ‡ๅคง้‡ๅฏนๆฏ”ๅฎž้ชŒๅฏน้ข„่ฎญ็ปƒ่ฏญๆ–™ๅˆ†ๅธƒ่ฟ›่กŒไบ†ไผ˜ๅŒ–ใ€‚
  3. ไผ˜็ง€็š„ๆ€ง่ƒฝ๏ผšQwen-1.8Bๆ”ฏๆŒ8192ไธŠไธ‹ๆ–‡้•ฟๅบฆ๏ผŒๅœจๅคšไธชไธญ่‹ฑๆ–‡ไธ‹ๆธธ่ฏ„ๆต‹ไปปๅŠกไธŠ๏ผˆๆถต็›–ๅธธ่ฏ†ๆŽจ็†ใ€ไปฃ็ ใ€ๆ•ฐๅญฆใ€็ฟป่ฏ‘็ญ‰๏ผ‰๏ผŒๆ•ˆๆžœๆ˜พ่‘—่ถ…่ถŠ็Žฐๆœ‰็š„็›ธ่ฟ‘่ง„ๆจกๅผ€ๆบๆจกๅž‹๏ผŒๅ…ทไฝ“่ฏ„ๆต‹็ป“ๆžœ่ฏท่ฏฆ่งไธ‹ๆ–‡ใ€‚
  4. ่ฆ†็›–ๆ›ดๅ…จ้ข็š„่ฏ่กจ๏ผš็›ธๆฏ”็›ฎๅ‰ไปฅไธญ่‹ฑ่ฏ่กจไธบไธป็š„ๅผ€ๆบๆจกๅž‹๏ผŒQwen-1.8Bไฝฟ็”จไบ†็บฆ15ไธ‡ๅคงๅฐ็š„่ฏ่กจใ€‚่ฏฅ่ฏ่กจๅฏนๅคš่ฏญ่จ€ๆ›ดๅŠ ๅ‹ๅฅฝ๏ผŒๆ–นไพฟ็”จๆˆทๅœจไธๆ‰ฉๅฑ•่ฏ่กจ็š„ๆƒ…ๅ†ตไธ‹ๅฏน้ƒจๅˆ†่ฏญ็ง่ฟ›่กŒ่ƒฝๅŠ›ๅขžๅผบๅ’Œๆ‰ฉๅฑ•ใ€‚
  5. ็ณป็ปŸๆŒ‡ไปค่ทŸ้š๏ผšQwen-1.8B-Chatๅฏไปฅ้€š่ฟ‡่ฐƒๆ•ด็ณป็ปŸๆŒ‡ไปค๏ผŒๅฎž็Žฐ่ง’่‰ฒๆ‰ฎๆผ”๏ผŒ่ฏญ่จ€้ฃŽๆ ผ่ฟ็งป๏ผŒไปปๅŠก่ฎพๅฎš๏ผŒๅ’Œ่กŒไธบ่ฎพๅฎš็ญ‰่ƒฝๅŠ›ใ€‚

ๅฆ‚ๆžœๆ‚จๆƒณไบ†่งฃๆ›ดๅคšๅ…ณไบŽ้€šไน‰ๅƒ้—ฎ1.8Bๅผ€ๆบๆจกๅž‹็š„็ป†่Š‚๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จๅ‚้˜…GitHubไปฃ็ ๅบ“ใ€‚

Qwen-1.8B is the 1.8B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-1.8B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-1.8B, we release Qwen-1.8B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-1.8B-Chat-int8.

The features of Qwen-1.8B include:

  1. Low-cost deployment: We provide int4 and int8 quantized versions, the minimum memory requirment for inference is less than 2GB, generating 2048 tokens only 3GB of memory usage. The minimum memory requirment of finetuning is only 6GB.
  2. Large-scale high-quality training corpora: It is pretrained on over 2.2 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments.
  3. Good performance: It supports 8192 context length and significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
  4. More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-1.8B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
  5. System prompt: Qwen-1.8B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by using system prompt.

For more details about the open-source model of Qwen-1.8B-chat int8, please refer to the GitHub code repository.


่ฆๆฑ‚๏ผˆRequirements๏ผ‰

  • python 3.8ๅŠไปฅไธŠ็‰ˆๆœฌ
  • pytorch 2.0ๅŠไปฅไธŠ็‰ˆๆœฌ
  • ๅปบ่ฎฎไฝฟ็”จCUDA 11.4ๅŠไปฅไธŠ๏ผˆGPU็”จๆˆทใ€flash-attention็”จๆˆท็ญ‰้œ€่€ƒ่™‘ๆญค้€‰้กน๏ผ‰
  • python 3.8 and above
  • pytorch 2.0 and above
  • CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)

ไพ่ต–้กน๏ผˆDependency๏ผ‰

่ฟ่กŒQwen-1.8B-Chat-Int8๏ผŒ่ฏท็กฎไฟๆปก่ถณไธŠ่ฟฐ่ฆๆฑ‚๏ผŒๅ†ๆ‰ง่กŒไปฅไธ‹pipๅ‘ฝไปคๅฎ‰่ฃ…ไพ่ต–ๅบ“ใ€‚ๅฆ‚ๅฎ‰่ฃ…auto-gptq้‡ๅˆฐ้—ฎ้ข˜๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จๅˆฐๅฎ˜ๆ–นrepoๆœ็ดขๅˆ้€‚็š„้ข„็ผ–่ฏ‘wheelใ€‚

To run Qwen-1.8B-Chat-Int8, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries. If you meet problems installing auto-gptq, we advise you to check out the official repo to find a pre-build wheel.

pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed
pip install auto-gptq optimum

ๅฆๅค–๏ผŒๆŽจ่ๅฎ‰่ฃ…flash-attentionๅบ“๏ผˆๅฝ“ๅ‰ๅทฒๆ”ฏๆŒflash attention 2๏ผ‰๏ผŒไปฅๅฎž็Žฐๆ›ด้ซ˜็š„ๆ•ˆ็އๅ’Œๆ›ดไฝŽ็š„ๆ˜พๅญ˜ๅ ็”จใ€‚

In addition, it is recommended to install the flash-attention library (we support flash attention 2 now.) for higher efficiency and lower memory usage.

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# ไธ‹ๆ–นๅฎ‰่ฃ…ๅฏ้€‰๏ผŒๅฎ‰่ฃ…ๅฏ่ƒฝๆฏ”่พƒ็ผ“ๆ…ขใ€‚
# pip install csrc/layer_norm
# pip install csrc/rotary

ๅฟซ้€Ÿไฝฟ็”จ๏ผˆQuickstart๏ผ‰

ไธ‹้ขๆˆ‘ไปฌๅฑ•็คบไบ†ไธ€ไธชไฝฟ็”จQwen-1.8B-Chat-Int8ๆจกๅž‹๏ผŒ่ฟ›่กŒๅคš่ฝฎๅฏน่ฏไบคไบ’็š„ๆ ทไพ‹๏ผš

We show an example of multi-turn interaction with Qwen-1.8B-Chat-Int8 in the following code:

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-1_8B-Chat-Int8",
    device_map="auto",
    trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)
print(response)
# ไฝ ๅฅฝ๏ผๅพˆ้ซ˜ๅ…ดไธบไฝ ๆไพ›ๅธฎๅŠฉใ€‚

# Qwen-1.8B-Chat็Žฐๅœจๅฏไปฅ้€š่ฟ‡่ฐƒๆ•ด็ณป็ปŸๆŒ‡ไปค๏ผˆSystem Prompt๏ผ‰๏ผŒๅฎž็Žฐ่ง’่‰ฒๆ‰ฎๆผ”๏ผŒ่ฏญ่จ€้ฃŽๆ ผ่ฟ็งป๏ผŒไปปๅŠก่ฎพๅฎš๏ผŒ่กŒไธบ่ฎพๅฎš็ญ‰่ƒฝๅŠ›ใ€‚
# Qwen-1.8B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by system prompt.
response, _ = model.chat(tokenizer, "ไฝ ๅฅฝๅ‘€", history=None, system="่ฏท็”จไบŒๆฌกๅ…ƒๅฏ็ˆฑ่ฏญๆฐ”ๅ’Œๆˆ‘่ฏด่ฏ")
print(response)
# ไฝ ๅฅฝๅ•Š๏ผๆˆ‘ๆ˜ฏไธ€ๅชๅฏ็ˆฑ็š„ไบŒๆฌกๅ…ƒ็Œซๅ’ชๅ“ฆ๏ผŒไธ็Ÿฅ้“ไฝ ๆœ‰ไป€ไนˆ้—ฎ้ข˜้œ€่ฆๆˆ‘ๅธฎๅฟ™่งฃ็ญ”ๅ—๏ผŸ

response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")
print(response)
# Your colleague is an outstanding worker! Their dedication and hard work are truly inspiring. They always go above and beyond to ensure that 
# their tasks are completed on time and to the highest standard. I am lucky to have them as a colleague, and I know I can count on them to handle any challenge that comes their way.

ๅ…ณไบŽๆ›ดๅคš็š„ไฝฟ็”จ่ฏดๆ˜Ž๏ผŒ่ฏทๅ‚่€ƒๆˆ‘ไปฌ็š„GitHub repo่Žทๅ–ๆ›ดๅคšไฟกๆฏใ€‚

For more information, please refer to our GitHub repo for more information.

Tokenizer

ๆณจ๏ผšไฝœไธบๆœฏ่ฏญ็š„โ€œtokenizationโ€ๅœจไธญๆ–‡ไธญๅฐšๆ— ๅ…ฑ่ฏ†็š„ๆฆ‚ๅฟตๅฏนๅบ”๏ผŒๆœฌๆ–‡ๆกฃ้‡‡็”จ่‹ฑๆ–‡่กจ่พพไปฅๅˆฉ่ฏดๆ˜Žใ€‚

ๅŸบไบŽtiktoken็š„ๅˆ†่ฏๅ™จๆœ‰ๅˆซไบŽๅ…ถไป–ๅˆ†่ฏๅ™จ๏ผŒๆฏ”ๅฆ‚sentencepieceๅˆ†่ฏๅ™จใ€‚ๅฐคๅ…ถๅœจๅพฎ่ฐƒ้˜ถๆฎต๏ผŒ้œ€่ฆ็‰นๅˆซๆณจๆ„็‰นๆฎŠtoken็š„ไฝฟ็”จใ€‚ๅ…ณไบŽtokenizer็š„ๆ›ดๅคšไฟกๆฏ๏ผŒไปฅๅŠๅพฎ่ฐƒๆ—ถๆถ‰ๅŠ็š„็›ธๅ…ณไฝฟ็”จ๏ผŒ่ฏทๅ‚้˜…ๆ–‡ๆกฃใ€‚

Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the documentation.

้‡ๅŒ– (Quantization)

็”จๆณ• (Usage)

่ฏทๆณจๆ„๏ผšๆˆ‘ไปฌๆ›ดๆ–ฐ้‡ๅŒ–ๆ–นๆกˆไธบๅŸบไบŽAutoGPTQ็š„้‡ๅŒ–๏ผŒๆไพ›Qwen-1.8B-Chat็š„Int8ๅŒ–ๆจกๅž‹็‚นๅ‡ป่ฟ™้‡Œใ€‚็›ธๆฏ”ๆญคๅ‰ๆ–นๆกˆ๏ผŒ่ฏฅๆ–นๆกˆๅœจๆจกๅž‹่ฏ„ๆต‹ๆ•ˆๆžœๅ‡ ไนŽๆ— ๆŸ๏ผŒไธ”ๅญ˜ๅ‚จ้œ€ๆฑ‚ๆ›ดไฝŽ๏ผŒๆŽจ็†้€Ÿๅบฆๆ›ดไผ˜ใ€‚

Note: we provide a new solution based on AutoGPTQ, and release an Int8 quantized model for Qwen-1.8B-Chat Click here, which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.

ไปฅไธ‹ๆˆ‘ไปฌๆไพ›็คบไพ‹่ฏดๆ˜Žๅฆ‚ไฝ•ไฝฟ็”จInt8้‡ๅŒ–ๆจกๅž‹ใ€‚ๅœจๅผ€ๅง‹ไฝฟ็”จๅ‰๏ผŒ่ฏทๅ…ˆไฟ่ฏๆปก่ถณ่ฆๆฑ‚๏ผˆๅฆ‚torch 2.0ๅŠไปฅไธŠ๏ผŒtransformers็‰ˆๆœฌไธบ4.32.0ๅŠไปฅไธŠ๏ผŒ็ญ‰็ญ‰๏ผ‰๏ผŒๅนถๅฎ‰่ฃ…ๆ‰€้œ€ๅฎ‰่ฃ…ๅŒ…๏ผš

Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:

pip install auto-gptq optimum

ๅฆ‚ๅฎ‰่ฃ…auto-gptq้‡ๅˆฐ้—ฎ้ข˜๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จๅˆฐๅฎ˜ๆ–นrepoๆœ็ดขๅˆ้€‚็š„้ข„็ผ–่ฏ‘wheelใ€‚

้šๅŽๅณๅฏไฝฟ็”จๅ’ŒไธŠ่ฟฐไธ€่‡ด็š„็”จๆณ•่ฐƒ็”จ้‡ๅŒ–ๆจกๅž‹๏ผš

If you meet problems installing auto-gptq, we advise you to check out the official repo to find a pre-build wheel.

Then you can load the quantized model easily and run inference as same as usual:

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-1_8B-Chat-Int8",
    device_map="auto",
    trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)

ๆ•ˆๆžœ่ฏ„ๆต‹

ๆˆ‘ไปฌไฝฟ็”จๅŽŸๅง‹ๆจกๅž‹็š„FP32ๅ’ŒBF16็ฒพๅบฆ๏ผŒไปฅๅŠ้‡ๅŒ–่ฟ‡็š„Int8ๅ’ŒInt4ๆจกๅž‹ๅœจๅŸบๅ‡†่ฏ„ๆต‹ไธŠๅšไบ†ๆต‹่ฏ•๏ผŒ็ป“ๆžœๅฆ‚ไธ‹ๆ‰€็คบ๏ผš

We illustrate the model performance of both FP32, BF16, Int8 and Int4 models on the benchmark. Results are shown below:

Quantization MMLU CEval (val) GSM8K Humaneval
FP32 43.4 57.0 33.0 26.8
BF16 43.3 55.6 33.7 26.2
Int8 43.1 55.8 33.0 27.4
Int4 42.9 52.8 31.2 25.0

ๆŽจ็†้€Ÿๅบฆ (Inference Speed)

ๆˆ‘ไปฌๆต‹็ฎ—ไบ†FP32ใ€BF16็ฒพๅบฆๅ’ŒInt8ใ€Int4้‡ๅŒ–ๆจกๅž‹็”Ÿๆˆ2048ๅ’Œ8192ไธชtoken็š„ๅนณๅ‡ๆŽจ็†้€Ÿๅบฆใ€‚ๅฆ‚ๅ›พๆ‰€็คบ๏ผš

We measured the average inference speed of generating 2048 and 8192 tokens under FP32, BF16 precision and Int8, Int4 quantization level, respectively.

Quantization FlashAttn Speed (2048 tokens) Speed (8192 tokens)
FP32 v2 52.96 47.35
BF16 v2 54.09 54.04
Int8 v2 55.56 55.62
Int4 v2 71.07 76.45
FP32 v1 52.00 45.80
BF16 v1 51.70 55.04
Int8 v1 53.16 53.33
Int4 v1 69.82 67.44
FP32 Disabled 52.28 44.95
BF16 Disabled 48.17 45.01
Int8 Disabled 52.16 52.99
Int4 Disabled 68.37 65.94

ๅ…ทไฝ“่€Œ่จ€๏ผŒๆˆ‘ไปฌ่ฎฐๅฝ•ๅœจ้•ฟๅบฆไธบ1็š„ไธŠไธ‹ๆ–‡็š„ๆกไปถไธ‹็”Ÿๆˆ8192ไธชtoken็š„ๆ€ง่ƒฝใ€‚่ฏ„ๆต‹่ฟ่กŒไบŽๅ•ๅผ A100-SXM4-80G GPU๏ผŒไฝฟ็”จPyTorch 2.0.1ๅ’ŒCUDA 11.4ใ€‚ๆŽจ็†้€Ÿๅบฆๆ˜ฏ็”Ÿๆˆ8192ไธชtoken็š„้€Ÿๅบฆๅ‡ๅ€ผใ€‚

In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.

ๆ˜พๅญ˜ไฝฟ็”จ (GPU Memory Usage)

ๆˆ‘ไปฌๆต‹็ฎ—ไบ†FP32ใ€BF16็ฒพๅบฆๅ’ŒInt8ใ€Int4้‡ๅŒ–ๆจกๅž‹็”Ÿๆˆ2048ไธชๅŠ8192ไธชtoken๏ผˆๅ•ไธชtokenไฝœไธบ่พ“ๅ…ฅ๏ผ‰็š„ๅณฐๅ€ผๆ˜พๅญ˜ๅ ็”จๆƒ…ๅ†ตใ€‚็ป“ๆžœๅฆ‚ไธ‹ๆ‰€็คบ๏ผš

We also profile the peak GPU memory usage for generating 2048 tokens and 8192 tokens (with single token as context) under FP32, BF16 or Int8, Int4 quantization level, respectively. The results are shown below.

Quantization Level Peak Usage for Encoding 2048 Tokens Peak Usage for Generating 8192 Tokens
FP32 8.45GB 13.06GB
BF16 4.23GB 6.48GB
Int8 3.48GB 5.34GB
Int4 2.91GB 4.80GB

ไธŠ่ฟฐๆ€ง่ƒฝๆต‹็ฎ—ไฝฟ็”จๆญค่„šๆœฌๅฎŒๆˆใ€‚

The above speed and memory profiling are conducted using this script.

ๆจกๅž‹็ป†่Š‚๏ผˆModel๏ผ‰

ไธŽQwen-1.8B้ข„่ฎญ็ปƒๆจกๅž‹็›ธๅŒ๏ผŒQwen-1.8B-Chatๆจกๅž‹่ง„ๆจกๅŸบๆœฌๆƒ…ๅ†ตๅฆ‚ไธ‹ๆ‰€็คบ

The details of the model architecture of Qwen-1.8B-Chat are listed as follows

Hyperparameter Value
n_layers 24
n_heads 16
d_model 2048
vocab size 151851
sequence length 8192

ๅœจไฝ็ฝฎ็ผ–็ ใ€FFNๆฟ€ๆดปๅ‡ฝๆ•ฐๅ’Œnormalization็š„ๅฎž็Žฐๆ–นๅผไธŠ๏ผŒๆˆ‘ไปฌไนŸ้‡‡็”จไบ†็›ฎๅ‰ๆœ€ๆต่กŒ็š„ๅšๆณ•๏ผŒ ๅณRoPE็›ธๅฏนไฝ็ฝฎ็ผ–็ ใ€SwiGLUๆฟ€ๆดปๅ‡ฝๆ•ฐใ€RMSNorm๏ผˆๅฏ้€‰ๅฎ‰่ฃ…flash-attentionๅŠ ้€Ÿ๏ผ‰ใ€‚

ๅœจๅˆ†่ฏๅ™จๆ–น้ข๏ผŒ็›ธๆฏ”็›ฎๅ‰ไธปๆตๅผ€ๆบๆจกๅž‹ไปฅไธญ่‹ฑ่ฏ่กจไธบไธป๏ผŒQwen-1.8B-Chatไฝฟ็”จไบ†็บฆ15ไธ‡tokenๅคงๅฐ็š„่ฏ่กจใ€‚ ่ฏฅ่ฏ่กจๅœจGPT-4ไฝฟ็”จ็š„BPE่ฏ่กจcl100k_baseๅŸบ็ก€ไธŠ๏ผŒๅฏนไธญๆ–‡ใ€ๅคš่ฏญ่จ€่ฟ›่กŒไบ†ไผ˜ๅŒ–๏ผŒๅœจๅฏนไธญใ€่‹ฑใ€ไปฃ็ ๆ•ฐๆฎ็š„้ซ˜ๆ•ˆ็ผ–่งฃ็ ็š„ๅŸบ็ก€ไธŠ๏ผŒๅฏน้ƒจๅˆ†ๅคš่ฏญ่จ€ๆ›ดๅŠ ๅ‹ๅฅฝ๏ผŒๆ–นไพฟ็”จๆˆทๅœจไธๆ‰ฉๅฑ•่ฏ่กจ็š„ๆƒ…ๅ†ตไธ‹ๅฏน้ƒจๅˆ†่ฏญ็ง่ฟ›่กŒ่ƒฝๅŠ›ๅขžๅผบใ€‚ ่ฏ่กจๅฏนๆ•ฐๅญ—ๆŒ‰ๅ•ไธชๆ•ฐๅญ—ไฝๅˆ‡ๅˆ†ใ€‚่ฐƒ็”จ่พƒไธบ้ซ˜ๆ•ˆ็š„tiktokenๅˆ†่ฏๅบ“่ฟ›่กŒๅˆ†่ฏใ€‚

For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).

For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-1.8B-Chat uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization.

่ฏ„ๆต‹ๆ•ˆๆžœ๏ผˆEvaluation๏ผ‰

ๅฏนไบŽQwen-1.8B-Chatๆจกๅž‹๏ผŒๆˆ‘ไปฌๅŒๆ ท่ฏ„ๆต‹ไบ†ๅธธ่ง„็š„ไธญๆ–‡็†่งฃ๏ผˆC-Eval๏ผ‰ใ€่‹ฑๆ–‡็†่งฃ๏ผˆMMLU๏ผ‰ใ€ไปฃ็ ๏ผˆHumanEval๏ผ‰ๅ’Œๆ•ฐๅญฆ๏ผˆGSM8K๏ผ‰็ญ‰ๆƒๅจไปปๅŠก๏ผŒๅŒๆ—ถๅŒ…ๅซไบ†้•ฟๅบๅˆ—ไปปๅŠก็š„่ฏ„ๆต‹็ป“ๆžœใ€‚็”ฑไบŽQwen-1.8B-Chatๆจกๅž‹็ป่ฟ‡ๅฏน้ฝๅŽ๏ผŒๆฟ€ๅ‘ไบ†่พƒๅผบ็š„ๅค–้ƒจ็ณป็ปŸ่ฐƒ็”จ่ƒฝๅŠ›๏ผŒๆˆ‘ไปฌ่ฟ˜่ฟ›่กŒไบ†ๅทฅๅ…ทไฝฟ็”จ่ƒฝๅŠ›ๆ–น้ข็š„่ฏ„ๆต‹ใ€‚

ๆ็คบ๏ผš็”ฑไบŽ็กฌไปถๅ’Œๆก†ๆžถ้€ ๆˆ็š„่ˆๅ…ฅ่ฏฏๅทฎ๏ผŒๅค็Žฐ็ป“ๆžœๅฆ‚ๆœ‰ๆณขๅŠจๅฑžไบŽๆญฃๅธธ็Žฐ่ฑกใ€‚

For Qwen-1.8B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage.

Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible.

ไธญๆ–‡่ฏ„ๆต‹๏ผˆChinese Evaluation๏ผ‰

C-Eval

ๅœจC-Eval้ชŒ่ฏ้›†ไธŠ๏ผŒๆˆ‘ไปฌ่ฏ„ไปทไบ†Qwen-1.8B-Chatๆจกๅž‹็š„ๅ‡†็กฎ็އ

We demonstrate the accuracy of Qwen-1.8B-Chat on C-Eval validation set

Model Acc.
RedPajama-INCITE-Chat-3B 18.3
OpenBuddy-3B 23.5
Firefly-Bloom-1B4 23.6
OpenLLaMA-Chinese-3B 24.4
LLaMA2-7B-Chat 31.9
ChatGLM2-6B-Chat 52.6
InternLM-7B-Chat 53.6
Qwen-1.8B-Chat (0-shot) 55.6
Qwen-7B-Chat (0-shot) 59.7
Qwen-7B-Chat (5-shot) 59.3

C-Evalๆต‹่ฏ•้›†ไธŠ๏ผŒQwen-1.8B-Chatๆจกๅž‹็š„zero-shotๅ‡†็กฎ็އ็ป“ๆžœๅฆ‚ไธ‹๏ผš

The zero-shot accuracy of Qwen-1.8B-Chat on C-Eval testing set is provided below:

Model Avg. STEM Social Sciences Humanities Others
Chinese-Alpaca-Plus-13B 41.5 36.6 49.7 43.1 41.2
Chinese-Alpaca-2-7B 40.3 - - - -
ChatGLM2-6B-Chat 50.1 46.4 60.4 50.6 46.9
Baichuan-13B-Chat 51.5 43.7 64.6 56.2 49.2
Qwen-1.8B-Chat 53.8 48.4 68.0 56.5 48.3
Qwen-7B-Chat 58.6 53.3 72.1 62.8 52.0

่‹ฑๆ–‡่ฏ„ๆต‹๏ผˆEnglish Evaluation๏ผ‰

MMLU

MMLU่ฏ„ๆต‹้›†ไธŠ๏ผŒQwen-1.8B-Chatๆจกๅž‹็š„ๅ‡†็กฎ็އๅฆ‚ไธ‹๏ผŒๆ•ˆๆžœๅŒๆ ทๅœจๅŒ็ฑปๅฏน้ฝๆจกๅž‹ไธญๅŒๆ ท่กจ็Žฐ่พƒไผ˜ใ€‚

The accuracy of Qwen-1.8B-Chat on MMLU is provided below. The performance of Qwen-1.8B-Chat still on the top between other human-aligned models with comparable size.

Model Acc.
Firefly-Bloom-1B4 23.8
OpenBuddy-3B 25.5
RedPajama-INCITE-Chat-3B 25.5
OpenLLaMA-Chinese-3B 25.7
ChatGLM2-6B-Chat 46.0
LLaMA2-7B-Chat 46.2
InternLM-7B-Chat 51.1
Baichuan2-7B-Chat 52.9
Qwen-1.8B-Chat (0-shot) 43.3
Qwen-7B-Chat (0-shot) 55.8
Qwen-7B-Chat (5-shot) 57.0

ไปฃ็ ่ฏ„ๆต‹๏ผˆCoding Evaluation๏ผ‰

Qwen-1.8B-ChatๅœจHumanEval็š„zero-shot Pass@1ๆ•ˆๆžœๅฆ‚ไธ‹

The zero-shot Pass@1 of Qwen-1.8B-Chat on HumanEval is demonstrated below

Model Pass@1
Firefly-Bloom-1B4 0.6
OpenLLaMA-Chinese-3B 4.9
RedPajama-INCITE-Chat-3B 6.1
OpenBuddy-3B 10.4
ChatGLM2-6B-Chat 11.0
LLaMA2-7B-Chat 12.2
Baichuan2-7B-Chat 13.4
InternLM-7B-Chat 14.6
Qwen-1.8B-Chat 26.2
Qwen-7B-Chat 37.2

ๆ•ฐๅญฆ่ฏ„ๆต‹๏ผˆMathematics Evaluation๏ผ‰

ๅœจ่ฏ„ๆต‹ๆ•ฐๅญฆ่ƒฝๅŠ›็š„GSM8KไธŠ๏ผŒQwen-1.8B-Chat็š„ๅ‡†็กฎ็އ็ป“ๆžœๅฆ‚ไธ‹

The accuracy of Qwen-1.8B-Chat on GSM8K is shown below

Model Acc.
Firefly-Bloom-1B4 2.4
RedPajama-INCITE-Chat-3B 2.5
OpenLLaMA-Chinese-3B 3.0
OpenBuddy-3B 12.6
LLaMA2-7B-Chat 26.3
ChatGLM2-6B-Chat 28.8
Baichuan2-7B-Chat 32.8
InternLM-7B-Chat 33.0
Qwen-1.8B-Chat (0-shot) 33.7
Qwen-7B-Chat (0-shot) 50.3
Qwen-7B-Chat (8-shot) 54.1

่ฏ„ๆต‹ๅค็Žฐ๏ผˆReproduction๏ผ‰

ๆˆ‘ไปฌๆไพ›ไบ†่ฏ„ๆต‹่„šๆœฌ๏ผŒๆ–นไพฟๅคงๅฎถๅค็Žฐๆจกๅž‹ๆ•ˆๆžœ๏ผŒ่ฏฆ่ง้“พๆŽฅใ€‚ๆ็คบ๏ผš็”ฑไบŽ็กฌไปถๅ’Œๆก†ๆžถ้€ ๆˆ็š„่ˆๅ…ฅ่ฏฏๅทฎ๏ผŒๅค็Žฐ็ป“ๆžœๅฆ‚ๆœ‰ๅฐๅน…ๆณขๅŠจๅฑžไบŽๆญฃๅธธ็Žฐ่ฑกใ€‚

We have provided evaluation scripts to reproduce the performance of our model, details as link.

FAQ

ๅฆ‚้‡ๅˆฐ้—ฎ้ข˜๏ผŒๆ•ฌ่ฏทๆŸฅ้˜…FAQไปฅๅŠissueๅŒบ๏ผŒๅฆ‚ไปๆ— ๆณ•่งฃๅ†ณๅ†ๆไบคissueใ€‚

If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.

ๅผ•็”จ (Citation)

ๅฆ‚ๆžœไฝ ่ง‰ๅพ—ๆˆ‘ไปฌ็š„ๅทฅไฝœๅฏนไฝ ๆœ‰ๅธฎๅŠฉ๏ผŒๆฌข่ฟŽๅผ•็”จ๏ผ

If you find our work helpful, feel free to give us a cite.

@article{qwen,
  title={Qwen Technical Report},
  author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
  journal={arXiv preprint arXiv:2309.16609},
  year={2023}
}

ไฝฟ็”จๅ่ฎฎ๏ผˆLicense Agreement๏ผ‰

ๆˆ‘ไปฌ็š„ไปฃ็ ๅ’Œๆจกๅž‹ๆƒ้‡ๅฏนๅญฆๆœฏ็ ”็ฉถๅฎŒๅ…จๅผ€ๆ”พใ€‚่ฏทๆŸฅ็œ‹LICENSEๆ–‡ไปถไบ†่งฃๅ…ทไฝ“็š„ๅผ€ๆบๅ่ฎฎ็ป†่Š‚ใ€‚ๅฆ‚้œ€ๅ•†็”จ๏ผŒ่ฏท่”็ณปๆˆ‘ไปฌใ€‚

Our code and checkpoints are open to research purpose. Check the LICENSE for more details about the license. For commercial use, please contact us.

่”็ณปๆˆ‘ไปฌ๏ผˆContact Us๏ผ‰

ๅฆ‚ๆžœไฝ ๆƒณ็ป™ๆˆ‘ไปฌ็š„็ ”ๅ‘ๅ›ข้˜Ÿๅ’Œไบงๅ“ๅ›ข้˜Ÿ็•™่จ€๏ผŒๆฌข่ฟŽๅŠ ๅ…ฅๆˆ‘ไปฌ็š„ๅพฎไฟก็พคใ€้’‰้’‰็พคไปฅๅŠDiscord๏ผๅŒๆ—ถ๏ผŒไนŸๆฌข่ฟŽ้€š่ฟ‡้‚ฎไปถ๏ผˆqianwen_opensource@alibabacloud.com๏ผ‰่”็ณปๆˆ‘ไปฌใ€‚

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com.

Downloads last month
112
Safetensors
Model size
2B params
Tensor type
I32
ยท
BF16
ยท
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Spaces using Qwen/Qwen-1_8B-Chat-Int8 79

Collection including Qwen/Qwen-1_8B-Chat-Int8

Papers for Qwen/Qwen-1_8B-Chat-Int8