DeepSeek-V4-Flash MXFP4 Experts + INT8 Dense

Experimental DeepSeek-V4-Flash MXFP4 experts + INT8 dense checkpoint for Ampere serving in the AppMana vLLM fork.

Source checkpoint: deepseek-ai/DeepSeek-V4-Flash@fd53f944496234770ba80e15004f9b6d269a71f5

Conversion:

CUDA_VISIBLE_DEVICES=1 .venv/bin/python tools/ampere/dsv4_requant_checkpoint.py \
  --src /home/administrator/inference/.cache/huggingface/models--deepseek-ai--DeepSeek-V4-Flash/snapshots/fd53f944496234770ba80e15004f9b6d269a71f5 \
  --dst /home/administrator/inference/deepseek-v4-flash-dsv4-mxfp4-int8-channel-vllm \
  --device cuda:0 \
  --dense-int8-strategy channel \
  --expert-format mxfp4 \
  --num-output-shards 72 \
  --overwrite

Expected canary values:

quantization_config.quant_method: dsv4_mxfp4_int8
quantization_config.format: mxfp4_int8_packed
Dense linear weights: signed INT8, channelwise scales
Routed experts: native MXFP4 with native E8M0 scales
expert_dtype: fp4

This artifact is still under active development and should be evaluated for quality before production use.

Downloads last month: 15

Safetensors

Model size

158B params

Tensor type

BF16

F32

F8_E8M0

I64

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support