Instructions to use cloudyu/DeepSeek-V4-Flash-4Expert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cloudyu/DeepSeek-V4-Flash-4Expert with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cloudyu/DeepSeek-V4-Flash-4Expert")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cloudyu/DeepSeek-V4-Flash-4Expert")
model = AutoModelForCausalLM.from_pretrained("cloudyu/DeepSeek-V4-Flash-4Expert")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use cloudyu/DeepSeek-V4-Flash-4Expert with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cloudyu/DeepSeek-V4-Flash-4Expert"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cloudyu/DeepSeek-V4-Flash-4Expert",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert

SGLang

How to use cloudyu/DeepSeek-V4-Flash-4Expert with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cloudyu/DeepSeek-V4-Flash-4Expert" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cloudyu/DeepSeek-V4-Flash-4Expert",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cloudyu/DeepSeek-V4-Flash-4Expert" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cloudyu/DeepSeek-V4-Flash-4Expert",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use cloudyu/DeepSeek-V4-Flash-4Expert with Docker Model Runner:
```
docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert
```

What is DeepSeek-V4-Flash-4Expert?

DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts (MoE) language model with 13B activated parameters, supporting a context length of one million tokens. The original model uses tok=6 by default.

cloudyu/DeepSeek-V4-Flash-4Expert is the same model with the number of activated experts per token reduced from 6 → 4, while keeping all other weights identical. This change:

Reduces inference compute by ~33% (fewer active experts per forward pass)
Improves generation throughput by ~8–11%
Maintains or improves accuracy on both code generation and knowledge benchmarks
Uses the same FP4 + FP8 mixed precision format as the original

Key Changes from Original

Configuration	Original (top_k=6)	This Model (top_k=4)
`num_experts_per_tok`	6	4
Activated params	~13B	~11B
Total params	284B	284B
Routing method	`noaux_tc`	`noaux_tc`
All other weights	identical	identical

The tid2eid (expert routing) weight tensors have been reshaped from [vocab_size, 6] to [vocab_size, 4] — only the first 4 columns are retained, matching the original training distribution order. No additional training or fine-tuning was performed; this is purely an inference-time configuration change.

Independent Evaluation Results

We evaluated the model against the original top_k=6 configuration on two benchmarks: HumanEval (code generation) and MMLU-Pro (multi-domain knowledge).

HumanEval (Pass@1)

##eval details

Configuration	Pass@1	Generation Time
Top_k=4 (this model)	95.73% (157/164)	56.83s
Top_k=6 (original)	95.73% (157/164)	64.06s

Identical accuracy on code generation
12.7% faster generation

eval details

MMLU-Pro (Accuracy)

##eval details

Configuration	Accuracy	Generation Time
Top_k=4 (this model)	41.46% (4988/12032)	78.24s
Top_k=6 (original)	37.77% (4545/12032)	85.16s

+3.69 percentage points higher accuracy
8.1% faster generation

eval details

Category Breakdown (MMLU-Pro)

Category	top_k=4	top_k=6	Delta
biology	68.62%	72.66%	−4.04pp
business	39.04%	21.67%	+17.36pp
chemistry	14.58%	7.16%	+7.42pp
computer science	47.80%	44.63%	+3.17pp
economics	66.35%	65.05%	+1.30pp
engineering	25.39%	13.21%	+12.18pp
health	59.54%	63.08%	−3.55pp
history	50.13%	59.58%	−9.45pp
law	33.51%	35.88%	−2.36pp
math	28.13%	15.47%	+12.66pp
other	55.09%	56.71%	−1.62pp
philosophy	53.91%	55.71%	−1.80pp
physics	20.32%	14.55%	+5.77pp
psychology	69.17%	71.93%	−2.76pp

STEM and business categories (math, engineering, business, chemistry, physics, computer science) show significant improvements with top_k=4, while humanities and life sciences show modest regression.

Summary

Top_k=4 wins in all practical metrics: higher or equal accuracy, faster inference, lower memory bandwidth usage
The improvement is particularly pronounced on math, engineering, and business reasoning tasks
The original top_k=6 configuration provides marginal benefits only in humanities/life sciences categories
For production deployment, top_k=4 is the recommended configuration

Full evaluation reports, scripts, and raw results are available in the eval/ directory of this repository.

Model Downloads

Model	#Total Params	#Activated Params	Context Length	Precision	Download
DeepSeek-V4-Flash (original)	284B	13B (top_k=6)	1M	FP4 + FP8 Mixed	HuggingFace
DeepSeek-V4-Flash-4Expert (this)	284B	~11B (top_k=4)	1M	FP4 + FP8 Mixed	HuggingFace

Chat Template

This release does not include a Jinja-format chat template. Instead, we provide a dedicated encoding folder with Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding folder for full documentation.

A brief example:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "hello"},
    {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
    {"role": "user", "content": "1+1=?"}
]

# messages -> string
prompt = encode_messages(messages, thinking_mode="thinking")

# string -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("cloudyu/DeepSeek-V4-Flash-4E")
tokens = tokenizer.encode(prompt)

How to Run Locally

Please refer to the inference folder for detailed instructions on running DeepSeek-V4 locally, including model weight conversion and interactive chat demos.

For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.

License

This repository and the model weights are licensed under the MIT License.

Contact

If you have any questions, please raise an issue or contact cloudyu on HuggingFace.

DeepSeek-V4-Flash-4E 中文说明

基于 DeepSeek-V4-Flash 的改进变体，将 top k 从 6 改为 4，实现最优推理效率。

HuggingFace: cloudyu/DeepSeek-V4-Flash-4E

概述

DeepSeek-V4-Flash 是一个 284B 参数的混合专家（MoE）语言模型，激活 13B 参数，支持百万 token 上下文长度。原始模型默认使用 num_experts_per_tok=6。

DeepSeek-V4-Flash-4E 将每 token 激活专家数从 6 减少为 4，保持所有权重不变。这一改动带来：

**推理计算量减少约 33%**（更少的激活专家）
生成吞吐量提升约 8–11%
准确率保持不变甚至更高
保持原有的 FP4 + FP8 混合精度 格式

与原始模型的关键区别

配置项	原始 (top_k=6)	本模型 (top_k=4)
`num_experts_per_tok`	6	4
激活参数量	~13B	~11B
总参数量	284B	284B
路由方式	`noaux_tc`	`noaux_tc`
其他权重	完全相同	完全相同

tid2eid（专家路由）权重张量已从 [vocab_size, 6] 重塑为 [vocab_size, 4]——仅保留前 4 列，与原始训练分布顺序一致。未进行任何额外训练或微调，纯属推理时配置调整。

独立评测结果

我们在 HumanEval（代码生成）和 MMLU-Pro（多领域知识问答）两个基准上进行了对比评测。

HumanEval (Pass@1)

配置	Pass@1	生成耗时
Top_k=4（本模型）	95.73% (157/164)	56.83s
Top_k=6（原始）	95.73% (157/164)	64.06s

代码生成准确率完全相同
速度快 12.7%

MMLU-Pro (Accuracy)

配置	准确率	生成耗时
Top_k=4（本模型）	41.46% (4988/12032)	78.24s
Top_k=6（原始）	37.77% (4545/12032)	85.16s

准确率高出 3.69 个百分点
速度快 8.1%

MMLU-Pro 分类别对比

类别	top_k=4	top_k=6	差值
biology	68.62%	72.66%	−4.04pp
business	39.04%	21.67%	+17.36pp
chemistry	14.58%	7.16%	+7.42pp
computer science	47.80%	44.63%	+3.17pp
economics	66.35%	65.05%	+1.30pp
engineering	25.39%	13.21%	+12.18pp
health	59.54%	63.08%	−3.55pp
history	50.13%	59.58%	−9.45pp
law	33.51%	35.88%	−2.36pp
math	28.13%	15.47%	+12.66pp
other	55.09%	56.71%	−1.62pp
philosophy	53.91%	55.71%	−1.80pp
physics	20.32%	14.55%	+5.77pp
psychology	69.17%	71.93%	−2.76pp

STEM 和商科类别（math、engineering、business、chemistry、physics、computer science）在使用 top_k=4 时提升显著，而人文和生命科学类别略有下降。

总结

Top_k=4 在所有实用指标上胜出： 更高或相等的准确率、更快推理速度、更低显存带宽消耗
在数学、工程和商业推理任务上优势尤为突出
原始 top_k=6 仅在人文/生命科学类别上略有优势
对于生产部署，top_k=4 是推荐配置

完整的评测报告、脚本和原始数据位于本仓库的 eval/ 目录下。

模型下载

模型	总参数量	激活参数量	上下文长度	精度	下载
DeepSeek-V4-Flash (原始)	284B	13B (top_k=6)	1M	FP4 + FP8 混合	HuggingFace
DeepSeek-V4-Flash-4E (本模型)	284B	~11B (top_k=4)	1M	FP4 + FP8 混合	HuggingFace

聊天模板

本仓库不包含 Jinja 格式的聊天模板。我们提供了专用的 encoding 文件夹，内含 Python 脚本和测试用例，演示如何将 OpenAI 兼容格式的消息编码为模型输入，以及如何解析模型输出。请参考 encoding 文件夹获取完整文档。

简单示例：

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "你好"},
    {"role": "assistant", "content": "你好！我是 DeepSeek。", "reasoning_content": "思考中..."},
    {"role": "user", "content": "1+1=?"}
]

# messages -> 字符串
prompt = encode_messages(messages, thinking_mode="thinking")

# 字符串 -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("cloudyu/DeepSeek-V4-Flash-4E")
tokens = tokenizer.encode(prompt)

本地运行

请参考 inference 文件夹获取本地运行 DeepSeek-V4 的详细说明，包括模型权重转换和交互式聊天演示。

本地部署时建议设置采样参数为 temperature = 1.0, top_p = 1.0。对于 Think Max 推理模式，建议将上下文窗口设置为至少 384K tokens。

许可协议

本仓库和模型权重采用 MIT 许可协议。

联系方式

如有任何问题，请在 HuggingFace 上提 issue 或联系 cloudyu。

Downloads last month: 13

Safetensors

Model size

158B params

Tensor type

BF16

I64

F32

F8_E8M0

F8_E4M3