DeepSeek-V4-Flash-TQ3-native

Native TQ3 conversion of deepseek-ai/DeepSeek-V4-Flash using turboquant-vllm.

This is a Gate C conversion artifact: structure and checkpoint packaging are validated, while quality and serving benchmarks are tracked separately.

Compression

  • Source checkpoint: 159.62 GB of FP4 expert + FP8 dense/attention weights
  • TQ3 safetensors output: 131.28 GB
  • Compression vs FP4+FP8 source: 1.2159x
  • Approximate compression vs BF16-equivalent: 2.4318x

The FP4+FP8 source ratio is the apples-to-apples number for comparisons with other downstream conversions from the same DeepSeek release.

Conversion

from turboquant_vllm.checkpoint import save_tq3_checkpoint

save_tq3_checkpoint(
    "deepseek-ai/DeepSeek-V4-Flash",
    "./deepseek-v4-flash-tq3",
    bits=3,
    group_size=128,
    trust_remote_code=True,
)

The output removes the source quantization_config from config.json and uses tq_config.json as the TurboQuant marker.

Serving status on vLLM 0.21 (May 2026)

This checkpoint cannot currently be served end-to-end on stock vLLM 0.21 + transformers 4.57. Five distinct walls in vllm/model_executor/models/deepseek_v4.py surface when loading a non-canonical (non-FP8-native) DSV4 checkpoint. Three are real upstream bugs filed with proposed fixes:

  • vllm-project/vllm#42741config.compress_ratios access fails when transformers ≥ 4.57 normalizes that legacy field into layer_types + compress_rates.
  • vllm-project/vllm#42769UnboundLocalError: name_mapped in expert weight loader when no entry in expert_mapping matches the incoming tensor name.
  • vllm-project/vllm#42777WeightsMapper suffix rule head.weight → lm_head.weight is non-idempotent; canonical lm_head.weight becomes lm_lm_head.weight.

The other two are not upstream bugs:

  • kv_cache_dtype defaults to auto, but DSV4 only supports fp8. Pass kv_cache_dtype="fp8" to the LLM(...) constructor.
  • config.quantization_config["scale_fmt"] is read unconditionally at attention-init even though self.scale_fmt is never used downstream. Workaround: monkey-patch DeepseekV4Config.__init__ to inject quantization_config = {"scale_fmt": "ue8m0"} if missing.

Status: the converter (save_tq3_checkpoint) is fully validated end-to-end; the published checkpoint is structurally correct (verified TQ3 packed expert tensors register on the model, 0 leaked source .scale sidecars, compression matches the math). The blocker is purely vLLM's DSV4 model module fragility against any non-canonical checkpoint shape. Once the three filed bugs land upstream (or get a small batch PR), this checkpoint serves cleanly with --quantization turboquant.

Local workaround scripts for the two shims + the source-file patch we run during setup are in the Gate D eval harness under scripts/gpu/deepseek-v4-flash-eval/ — see setup.sh (sourcefile patch) and eval.py (config shim).

Quality

Pending:

  • Wikitext-2 perplexity
  • GSM8K-200
  • fixed-prompt side-by-side judging
Downloads last month
188
Safetensors
Model size
116B params
Tensor type
I64
·
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for varjosoft/DeepSeek-V4-Flash-TQ3-native

Finetuned
(8)
this model