DeepSeek-V4-Flash-TQ3-native

Native TQ3 conversion of deepseek-ai/DeepSeek-V4-Flash using turboquant-vllm.

This is a Gate C conversion artifact: structure and checkpoint packaging are validated, while quality and serving benchmarks are tracked separately.

Compression

Source checkpoint: 159.62 GB of FP4 expert + FP8 dense/attention weights
TQ3 safetensors output: 131.28 GB
Compression vs FP4+FP8 source: 1.2159x
Approximate compression vs BF16-equivalent: 2.4318x

The FP4+FP8 source ratio is the apples-to-apples number for comparisons with other downstream conversions from the same DeepSeek release.

Conversion

from turboquant_vllm.checkpoint import save_tq3_checkpoint

save_tq3_checkpoint(
    "deepseek-ai/DeepSeek-V4-Flash",
    "./deepseek-v4-flash-tq3",
    bits=3,
    group_size=128,
    trust_remote_code=True,
)

The output removes the source quantization_config from config.json and uses tq_config.json as the TurboQuant marker.

Serving status on vLLM 0.21 (May 2026)

This checkpoint cannot currently be served end-to-end on stock vLLM 0.21 + transformers 4.57. Five distinct walls in vllm/model_executor/models/deepseek_v4.py surface when loading a non-canonical (non-FP8-native) DSV4 checkpoint. Three are real upstream bugs filed with proposed fixes:

vllm-project/vllm#42741 — config.compress_ratios access fails when transformers ≥ 4.57 normalizes that legacy field into layer_types + compress_rates.
vllm-project/vllm#42769 — UnboundLocalError: name_mapped in expert weight loader when no entry in expert_mapping matches the incoming tensor name.
vllm-project/vllm#42777 — WeightsMapper suffix rule head.weight → lm_head.weight is non-idempotent; canonical lm_head.weight becomes lm_lm_head.weight.

The other two are not upstream bugs:

kv_cache_dtype defaults to auto, but DSV4 only supports fp8. Pass kv_cache_dtype="fp8" to the LLM(...) constructor.
config.quantization_config["scale_fmt"] is read unconditionally at attention-init even though self.scale_fmt is never used downstream. Workaround: monkey-patch DeepseekV4Config.__init__ to inject quantization_config = {"scale_fmt": "ue8m0"} if missing.

Status: the converter (save_tq3_checkpoint) is fully validated end-to-end; the published checkpoint is structurally correct (verified TQ3 packed expert tensors register on the model, 0 leaked source .scale sidecars, compression matches the math). The blocker is purely vLLM's DSV4 model module fragility against any non-canonical checkpoint shape. Once the three filed bugs land upstream (or get a small batch PR), this checkpoint serves cleanly with --quantization turboquant.

Local workaround scripts for the two shims + the source-file patch we run during setup are in the Gate D eval harness under scripts/gpu/deepseek-v4-flash-eval/ — see setup.sh (sourcefile patch) and eval.py (config shim).

Quality

Pending:

Wikitext-2 perplexity
GSM8K-200
fixed-prompt side-by-side judging

Downloads last month: 188

Safetensors

Model size

116B params

Tensor type

I64

F32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for varjosoft/DeepSeek-V4-Flash-TQ3-native

Base model

deepseek-ai/DeepSeek-V4-Flash

Finetuned

(8)

this model