Instructions to use cloudyu/DeepSeek-V4-Flash-4Expert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cloudyu/DeepSeek-V4-Flash-4Expert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cloudyu/DeepSeek-V4-Flash-4Expert")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("cloudyu/DeepSeek-V4-Flash-4Expert") model = AutoModelForCausalLM.from_pretrained("cloudyu/DeepSeek-V4-Flash-4Expert") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use cloudyu/DeepSeek-V4-Flash-4Expert with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cloudyu/DeepSeek-V4-Flash-4Expert" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cloudyu/DeepSeek-V4-Flash-4Expert", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert
- SGLang
How to use cloudyu/DeepSeek-V4-Flash-4Expert with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cloudyu/DeepSeek-V4-Flash-4Expert" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cloudyu/DeepSeek-V4-Flash-4Expert", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cloudyu/DeepSeek-V4-Flash-4Expert" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cloudyu/DeepSeek-V4-Flash-4Expert", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use cloudyu/DeepSeek-V4-Flash-4Expert with Docker Model Runner:
docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert
What is DeepSeek-V4-Flash-4Expert?
DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts (MoE) language model with 13B activated parameters, supporting a context length of one million tokens. The original model uses tok=6 by default.
cloudyu/DeepSeek-V4-Flash-4Expert is the same model with the number of activated experts per token reduced from 6 → 4, while keeping all other weights identical. This change:
- Reduces inference compute by ~33% (fewer active experts per forward pass)
- Improves generation throughput by ~8–11%
- Maintains or improves accuracy on both code generation and knowledge benchmarks
- Uses the same FP4 + FP8 mixed precision format as the original
Key Changes from Original
| Configuration | Original (top_k=6) | This Model (top_k=4) |
|---|---|---|
num_experts_per_tok |
6 | 4 |
| Activated params | ~13B | ~11B |
| Total params | 284B | 284B |
| Routing method | noaux_tc |
noaux_tc |
| All other weights | identical | identical |
The tid2eid (expert routing) weight tensors have been reshaped from [vocab_size, 6] to [vocab_size, 4] — only the first 4 columns are retained, matching the original training distribution order. No additional training or fine-tuning was performed; this is purely an inference-time configuration change.
Independent Evaluation Results
We evaluated the model against the original top_k=6 configuration on two benchmarks: HumanEval (code generation) and MMLU-Pro (multi-domain knowledge).
HumanEval (Pass@1)
| Configuration | Pass@1 | Generation Time |
|---|---|---|
| Top_k=4 (this model) | 95.73% (157/164) | 56.83s |
| Top_k=6 (original) | 95.73% (157/164) | 64.06s |
- Identical accuracy on code generation
- 12.7% faster generation
MMLU-Pro (Accuracy)
| Configuration | Accuracy | Generation Time |
|---|---|---|
| Top_k=4 (this model) | 41.46% (4988/12032) | 78.24s |
| Top_k=6 (original) | 37.77% (4545/12032) | 85.16s |
- +3.69 percentage points higher accuracy
- 8.1% faster generation
Category Breakdown (MMLU-Pro)
| Category | top_k=4 | top_k=6 | Delta |
|---|---|---|---|
| biology | 68.62% | 72.66% | −4.04pp |
| business | 39.04% | 21.67% | +17.36pp |
| chemistry | 14.58% | 7.16% | +7.42pp |
| computer science | 47.80% | 44.63% | +3.17pp |
| economics | 66.35% | 65.05% | +1.30pp |
| engineering | 25.39% | 13.21% | +12.18pp |
| health | 59.54% | 63.08% | −3.55pp |
| history | 50.13% | 59.58% | −9.45pp |
| law | 33.51% | 35.88% | −2.36pp |
| math | 28.13% | 15.47% | +12.66pp |
| other | 55.09% | 56.71% | −1.62pp |
| philosophy | 53.91% | 55.71% | −1.80pp |
| physics | 20.32% | 14.55% | +5.77pp |
| psychology | 69.17% | 71.93% | −2.76pp |
STEM and business categories (math, engineering, business, chemistry, physics, computer science) show significant improvements with top_k=4, while humanities and life sciences show modest regression.
Summary
- Top_k=4 wins in all practical metrics: higher or equal accuracy, faster inference, lower memory bandwidth usage
- The improvement is particularly pronounced on math, engineering, and business reasoning tasks
- The original top_k=6 configuration provides marginal benefits only in humanities/life sciences categories
- For production deployment, top_k=4 is the recommended configuration
Full evaluation reports, scripts, and raw results are available in the
eval/directory of this repository.
Model Downloads
| Model | #Total Params | #Activated Params | Context Length | Precision | Download |
|---|---|---|---|---|---|
| DeepSeek-V4-Flash (original) | 284B | 13B (top_k=6) | 1M | FP4 + FP8 Mixed | HuggingFace |
| DeepSeek-V4-Flash-4Expert (this) | 284B | ~11B (top_k=4) | 1M | FP4 + FP8 Mixed | HuggingFace |
Chat Template
This release does not include a Jinja-format chat template. Instead, we provide a dedicated encoding folder with Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding folder for full documentation.
A brief example:
from encoding_dsv4 import encode_messages, parse_message_from_completion_text
messages = [
{"role": "user", "content": "hello"},
{"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
{"role": "user", "content": "1+1=?"}
]
# messages -> string
prompt = encode_messages(messages, thinking_mode="thinking")
# string -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("cloudyu/DeepSeek-V4-Flash-4E")
tokens = tokenizer.encode(prompt)
How to Run Locally
Please refer to the inference folder for detailed instructions on running DeepSeek-V4 locally, including model weight conversion and interactive chat demos.
For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.
License
This repository and the model weights are licensed under the MIT License.
Contact
If you have any questions, please raise an issue or contact cloudyu on HuggingFace.
DeepSeek-V4-Flash-4E 中文说明
基于 DeepSeek-V4-Flash 的改进变体,将 top k 从 6 改为 4,实现最优推理效率。
HuggingFace: cloudyu/DeepSeek-V4-Flash-4E
概述
DeepSeek-V4-Flash 是一个 284B 参数的混合专家(MoE)语言模型,激活 13B 参数,支持百万 token 上下文长度。原始模型默认使用 num_experts_per_tok=6。
DeepSeek-V4-Flash-4E 将每 token 激活专家数从 6 减少为 4,保持所有权重不变。这一改动带来:
- **推理计算量减少约 33%**(更少的激活专家)
- 生成吞吐量提升约 8–11%
- 准确率保持不变甚至更高
- 保持原有的 FP4 + FP8 混合精度 格式
与原始模型的关键区别
| 配置项 | 原始 (top_k=6) | 本模型 (top_k=4) |
|---|---|---|
num_experts_per_tok |
6 | 4 |
| 激活参数量 | ~13B | ~11B |
| 总参数量 | 284B | 284B |
| 路由方式 | noaux_tc |
noaux_tc |
| 其他权重 | 完全相同 | 完全相同 |
tid2eid(专家路由)权重张量已从 [vocab_size, 6] 重塑为 [vocab_size, 4]——仅保留前 4 列,与原始训练分布顺序一致。未进行任何额外训练或微调,纯属推理时配置调整。
独立评测结果
我们在 HumanEval(代码生成)和 MMLU-Pro(多领域知识问答)两个基准上进行了对比评测。
HumanEval (Pass@1)
| 配置 | Pass@1 | 生成耗时 |
|---|---|---|
| Top_k=4(本模型) | 95.73% (157/164) | 56.83s |
| Top_k=6(原始) | 95.73% (157/164) | 64.06s |
- 代码生成准确率完全相同
- 速度快 12.7%
MMLU-Pro (Accuracy)
| 配置 | 准确率 | 生成耗时 |
|---|---|---|
| Top_k=4(本模型) | 41.46% (4988/12032) | 78.24s |
| Top_k=6(原始) | 37.77% (4545/12032) | 85.16s |
- 准确率高出 3.69 个百分点
- 速度快 8.1%
MMLU-Pro 分类别对比
| 类别 | top_k=4 | top_k=6 | 差值 |
|---|---|---|---|
| biology | 68.62% | 72.66% | −4.04pp |
| business | 39.04% | 21.67% | +17.36pp |
| chemistry | 14.58% | 7.16% | +7.42pp |
| computer science | 47.80% | 44.63% | +3.17pp |
| economics | 66.35% | 65.05% | +1.30pp |
| engineering | 25.39% | 13.21% | +12.18pp |
| health | 59.54% | 63.08% | −3.55pp |
| history | 50.13% | 59.58% | −9.45pp |
| law | 33.51% | 35.88% | −2.36pp |
| math | 28.13% | 15.47% | +12.66pp |
| other | 55.09% | 56.71% | −1.62pp |
| philosophy | 53.91% | 55.71% | −1.80pp |
| physics | 20.32% | 14.55% | +5.77pp |
| psychology | 69.17% | 71.93% | −2.76pp |
STEM 和商科类别(math、engineering、business、chemistry、physics、computer science)在使用 top_k=4 时提升显著,而人文和生命科学类别略有下降。
总结
- Top_k=4 在所有实用指标上胜出: 更高或相等的准确率、更快推理速度、更低显存带宽消耗
- 在数学、工程和商业推理任务上优势尤为突出
- 原始 top_k=6 仅在人文/生命科学类别上略有优势
- 对于生产部署,top_k=4 是推荐配置
完整的评测报告、脚本和原始数据位于本仓库的
eval/目录下。
模型下载
| 模型 | 总参数量 | 激活参数量 | 上下文长度 | 精度 | 下载 |
|---|---|---|---|---|---|
| DeepSeek-V4-Flash (原始) | 284B | 13B (top_k=6) | 1M | FP4 + FP8 混合 | HuggingFace |
| DeepSeek-V4-Flash-4E (本模型) | 284B | ~11B (top_k=4) | 1M | FP4 + FP8 混合 | HuggingFace |
聊天模板
本仓库不包含 Jinja 格式的聊天模板。我们提供了专用的 encoding 文件夹,内含 Python 脚本和测试用例,演示如何将 OpenAI 兼容格式的消息编码为模型输入,以及如何解析模型输出。请参考 encoding 文件夹获取完整文档。
简单示例:
from encoding_dsv4 import encode_messages, parse_message_from_completion_text
messages = [
{"role": "user", "content": "你好"},
{"role": "assistant", "content": "你好!我是 DeepSeek。", "reasoning_content": "思考中..."},
{"role": "user", "content": "1+1=?"}
]
# messages -> 字符串
prompt = encode_messages(messages, thinking_mode="thinking")
# 字符串 -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("cloudyu/DeepSeek-V4-Flash-4E")
tokens = tokenizer.encode(prompt)
本地运行
请参考 inference 文件夹获取本地运行 DeepSeek-V4 的详细说明,包括模型权重转换和交互式聊天演示。
本地部署时建议设置采样参数为 temperature = 1.0, top_p = 1.0。对于 Think Max 推理模式,建议将上下文窗口设置为至少 384K tokens。
许可协议
本仓库和模型权重采用 MIT 许可协议。
联系方式
如有任何问题,请在 HuggingFace 上提 issue 或联系 cloudyu。
- Downloads last month
- 13