yukiwayx commited on
Commit
92ae4d5
verified
1 Parent(s): f3883ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -10,10 +10,10 @@ language:
10
  <p align="left">
11
  <strong>Technical report (coming soon)</strong> 路
12
  <a href="https://github.com/Tencent-BAC/FastMTP"><strong>Github</strong></a> 路
 
13
  <a href="https://modelscope.cn/models/TencentBAC/FastMTP"><strong>ModelScope</strong></a>
14
  </p>
15
 
16
-
17
  ## Overview
18
 
19
  FastMTP is a simple yet effective method that enhances Multi-Token Prediction (MTP) for speculative decoding during inference. Our approach fine-tunes a single MTP head with shared weights across multiple causal draft steps, enabling it to capture longer-range dependencies and achieve higher acceptance rates in speculative decoding. By incorporating language-aware vocabulary compression, we further reduce computational overhead during draft generation. Experimental results across diverse benchmarks demonstrate that FastMTP achieves an average of 2.03脳 speedup over vanilla next token prediction while maintaining lossless output quality. With low training cost and seamless integration into existing inference frameworks, FastMTP offers a practical and rapidly deployable solution for accelerating LLM inference.
 
10
  <p align="left">
11
  <strong>Technical report (coming soon)</strong> 路
12
  <a href="https://github.com/Tencent-BAC/FastMTP"><strong>Github</strong></a> 路
13
+ <a href="https://huggingface.co/TencentBAC/FastMTP"><strong>HuggingFace</strong></a> 路
14
  <a href="https://modelscope.cn/models/TencentBAC/FastMTP"><strong>ModelScope</strong></a>
15
  </p>
16
 
 
17
  ## Overview
18
 
19
  FastMTP is a simple yet effective method that enhances Multi-Token Prediction (MTP) for speculative decoding during inference. Our approach fine-tunes a single MTP head with shared weights across multiple causal draft steps, enabling it to capture longer-range dependencies and achieve higher acceptance rates in speculative decoding. By incorporating language-aware vocabulary compression, we further reduce computational overhead during draft generation. Experimental results across diverse benchmarks demonstrate that FastMTP achieves an average of 2.03脳 speedup over vanilla next token prediction while maintaining lossless output quality. With low training cost and seamless integration into existing inference frameworks, FastMTP offers a practical and rapidly deployable solution for accelerating LLM inference.