What is DeepSeek-V4-Flash-4Expert?

DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts (MoE) language model with 13B activated parameters, supporting a context length of one million tokens. The original model uses tok=6 by default.

cloudyu/DeepSeek-V4-Flash-4Expert is the same model with the number of activated experts per token reduced from 6 → 4, while keeping all other weights identical. This change:

  • Reduces inference compute by ~33% (fewer active experts per forward pass)
  • Improves generation throughput by ~8–11%
  • Maintains or improves accuracy on both code generation and knowledge benchmarks
  • Uses the same FP4 + FP8 mixed precision format as the original

Key Changes from Original

Configuration Original (top_k=6) This Model (top_k=4)
num_experts_per_tok 6 4
Activated params ~13B ~11B
Total params 284B 284B
Routing method noaux_tc noaux_tc
All other weights identical identical

The tid2eid (expert routing) weight tensors have been reshaped from [vocab_size, 6] to [vocab_size, 4] — only the first 4 columns are retained, matching the original training distribution order. No additional training or fine-tuning was performed; this is purely an inference-time configuration change.

Independent Evaluation Results

We evaluated the model against the original top_k=6 configuration on two benchmarks: HumanEval (code generation) and MMLU-Pro (multi-domain knowledge).

HumanEval (Pass@1)

##eval details

Configuration Pass@1 Generation Time
Top_k=4 (this model) 95.73% (157/164) 56.83s
Top_k=6 (original) 95.73% (157/164) 64.06s
  • Identical accuracy on code generation
  • 12.7% faster generation

eval details

MMLU-Pro (Accuracy)

##eval details

Configuration Accuracy Generation Time
Top_k=4 (this model) 41.46% (4988/12032) 78.24s
Top_k=6 (original) 37.77% (4545/12032) 85.16s
  • +3.69 percentage points higher accuracy
  • 8.1% faster generation

eval details

Category Breakdown (MMLU-Pro)

Category top_k=4 top_k=6 Delta
biology 68.62% 72.66% −4.04pp
business 39.04% 21.67% +17.36pp
chemistry 14.58% 7.16% +7.42pp
computer science 47.80% 44.63% +3.17pp
economics 66.35% 65.05% +1.30pp
engineering 25.39% 13.21% +12.18pp
health 59.54% 63.08% −3.55pp
history 50.13% 59.58% −9.45pp
law 33.51% 35.88% −2.36pp
math 28.13% 15.47% +12.66pp
other 55.09% 56.71% −1.62pp
philosophy 53.91% 55.71% −1.80pp
physics 20.32% 14.55% +5.77pp
psychology 69.17% 71.93% −2.76pp

STEM and business categories (math, engineering, business, chemistry, physics, computer science) show significant improvements with top_k=4, while humanities and life sciences show modest regression.

Summary

  • Top_k=4 wins in all practical metrics: higher or equal accuracy, faster inference, lower memory bandwidth usage
  • The improvement is particularly pronounced on math, engineering, and business reasoning tasks
  • The original top_k=6 configuration provides marginal benefits only in humanities/life sciences categories
  • For production deployment, top_k=4 is the recommended configuration

Full evaluation reports, scripts, and raw results are available in the eval/ directory of this repository.

Model Downloads

Model #Total Params #Activated Params Context Length Precision Download
DeepSeek-V4-Flash (original) 284B 13B (top_k=6) 1M FP4 + FP8 Mixed HuggingFace
DeepSeek-V4-Flash-4Expert (this) 284B ~11B (top_k=4) 1M FP4 + FP8 Mixed HuggingFace

Chat Template

This release does not include a Jinja-format chat template. Instead, we provide a dedicated encoding folder with Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding folder for full documentation.

A brief example:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "hello"},
    {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
    {"role": "user", "content": "1+1=?"}
]

# messages -> string
prompt = encode_messages(messages, thinking_mode="thinking")

# string -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("cloudyu/DeepSeek-V4-Flash-4E")
tokens = tokenizer.encode(prompt)

How to Run Locally

Please refer to the inference folder for detailed instructions on running DeepSeek-V4 locally, including model weight conversion and interactive chat demos.

For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.

License

This repository and the model weights are licensed under the MIT License.

Contact

If you have any questions, please raise an issue or contact cloudyu on HuggingFace.



DeepSeek-V4-Flash-4E 中文说明

基于 DeepSeek-V4-Flash 的改进变体,将 top k 从 6 改为 4,实现最优推理效率。

HuggingFace: cloudyu/DeepSeek-V4-Flash-4E

概述

DeepSeek-V4-Flash 是一个 284B 参数的混合专家(MoE)语言模型,激活 13B 参数,支持百万 token 上下文长度。原始模型默认使用 num_experts_per_tok=6

DeepSeek-V4-Flash-4E 将每 token 激活专家数从 6 减少为 4,保持所有权重不变。这一改动带来:

  • **推理计算量减少约 33%**(更少的激活专家)
  • 生成吞吐量提升约 8–11%
  • 准确率保持不变甚至更高
  • 保持原有的 FP4 + FP8 混合精度 格式

与原始模型的关键区别

配置项 原始 (top_k=6) 本模型 (top_k=4)
num_experts_per_tok 6 4
激活参数量 ~13B ~11B
总参数量 284B 284B
路由方式 noaux_tc noaux_tc
其他权重 完全相同 完全相同

tid2eid(专家路由)权重张量已从 [vocab_size, 6] 重塑为 [vocab_size, 4]——仅保留前 4 列,与原始训练分布顺序一致。未进行任何额外训练或微调,纯属推理时配置调整。

独立评测结果

我们在 HumanEval(代码生成)和 MMLU-Pro(多领域知识问答)两个基准上进行了对比评测。

HumanEval (Pass@1)

配置 Pass@1 生成耗时
Top_k=4(本模型) 95.73% (157/164) 56.83s
Top_k=6(原始) 95.73% (157/164) 64.06s
  • 代码生成准确率完全相同
  • 速度快 12.7%

MMLU-Pro (Accuracy)

配置 准确率 生成耗时
Top_k=4(本模型) 41.46% (4988/12032) 78.24s
Top_k=6(原始) 37.77% (4545/12032) 85.16s
  • 准确率高出 3.69 个百分点
  • 速度快 8.1%

MMLU-Pro 分类别对比

类别 top_k=4 top_k=6 差值
biology 68.62% 72.66% −4.04pp
business 39.04% 21.67% +17.36pp
chemistry 14.58% 7.16% +7.42pp
computer science 47.80% 44.63% +3.17pp
economics 66.35% 65.05% +1.30pp
engineering 25.39% 13.21% +12.18pp
health 59.54% 63.08% −3.55pp
history 50.13% 59.58% −9.45pp
law 33.51% 35.88% −2.36pp
math 28.13% 15.47% +12.66pp
other 55.09% 56.71% −1.62pp
philosophy 53.91% 55.71% −1.80pp
physics 20.32% 14.55% +5.77pp
psychology 69.17% 71.93% −2.76pp

STEM 和商科类别(math、engineering、business、chemistry、physics、computer science)在使用 top_k=4 时提升显著,而人文和生命科学类别略有下降。

总结

  • Top_k=4 在所有实用指标上胜出: 更高或相等的准确率、更快推理速度、更低显存带宽消耗
  • 在数学、工程和商业推理任务上优势尤为突出
  • 原始 top_k=6 仅在人文/生命科学类别上略有优势
  • 对于生产部署,top_k=4 是推荐配置

完整的评测报告、脚本和原始数据位于本仓库的 eval/ 目录下。

模型下载

模型 总参数量 激活参数量 上下文长度 精度 下载
DeepSeek-V4-Flash (原始) 284B 13B (top_k=6) 1M FP4 + FP8 混合 HuggingFace
DeepSeek-V4-Flash-4E (本模型) 284B ~11B (top_k=4) 1M FP4 + FP8 混合 HuggingFace

聊天模板

本仓库不包含 Jinja 格式的聊天模板。我们提供了专用的 encoding 文件夹,内含 Python 脚本和测试用例,演示如何将 OpenAI 兼容格式的消息编码为模型输入,以及如何解析模型输出。请参考 encoding 文件夹获取完整文档。

简单示例:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "你好"},
    {"role": "assistant", "content": "你好!我是 DeepSeek。", "reasoning_content": "思考中..."},
    {"role": "user", "content": "1+1=?"}
]

# messages -> 字符串
prompt = encode_messages(messages, thinking_mode="thinking")

# 字符串 -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("cloudyu/DeepSeek-V4-Flash-4E")
tokens = tokenizer.encode(prompt)

本地运行

请参考 inference 文件夹获取本地运行 DeepSeek-V4 的详细说明,包括模型权重转换和交互式聊天演示。

本地部署时建议设置采样参数为 temperature = 1.0, top_p = 1.0。对于 Think Max 推理模式,建议将上下文窗口设置为至少 384K tokens。

许可协议

本仓库和模型权重采用 MIT 许可协议

联系方式

如有任何问题,请在 HuggingFace 上提 issue 或联系 cloudyu

Downloads last month
13
Safetensors
Model size
158B params
Tensor type
BF16
·
I64
·
F32
·
F8_E8M0
·
F8_E4M3
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support