DeepSeek-V4-Flash-TQ3-native
Native TQ3 conversion of deepseek-ai/DeepSeek-V4-Flash using turboquant-vllm.
This is a Gate C conversion artifact: structure and checkpoint packaging are validated, while quality and serving benchmarks are tracked separately.
Compression
- Source checkpoint: 159.62 GB of FP4 expert + FP8 dense/attention weights
- TQ3 safetensors output: 131.28 GB
- Compression vs FP4+FP8 source: 1.2159x
- Approximate compression vs BF16-equivalent: 2.4318x
The FP4+FP8 source ratio is the apples-to-apples number for comparisons with other downstream conversions from the same DeepSeek release.
Conversion
from turboquant_vllm.checkpoint import save_tq3_checkpoint
save_tq3_checkpoint(
"deepseek-ai/DeepSeek-V4-Flash",
"./deepseek-v4-flash-tq3",
bits=3,
group_size=128,
trust_remote_code=True,
)
The output removes the source quantization_config from config.json and uses
tq_config.json as the TurboQuant marker.
Serving status on vLLM 0.21 (May 2026)
This checkpoint cannot currently be served end-to-end on stock vLLM 0.21 + transformers 4.57. Five distinct walls in vllm/model_executor/models/deepseek_v4.py surface when loading a non-canonical (non-FP8-native) DSV4 checkpoint. Three are real upstream bugs filed with proposed fixes:
- vllm-project/vllm#42741 —
config.compress_ratiosaccess fails when transformers ≥ 4.57 normalizes that legacy field intolayer_types+compress_rates. - vllm-project/vllm#42769 —
UnboundLocalError: name_mappedin expert weight loader when no entry inexpert_mappingmatches the incoming tensor name. - vllm-project/vllm#42777 —
WeightsMappersuffix rulehead.weight → lm_head.weightis non-idempotent; canonicallm_head.weightbecomeslm_lm_head.weight.
The other two are not upstream bugs:
kv_cache_dtypedefaults toauto, but DSV4 only supportsfp8. Passkv_cache_dtype="fp8"to theLLM(...)constructor.config.quantization_config["scale_fmt"]is read unconditionally at attention-init even thoughself.scale_fmtis never used downstream. Workaround: monkey-patchDeepseekV4Config.__init__to injectquantization_config = {"scale_fmt": "ue8m0"}if missing.
Status: the converter (save_tq3_checkpoint) is fully validated end-to-end; the published checkpoint is structurally correct (verified TQ3 packed expert tensors register on the model, 0 leaked source .scale sidecars, compression matches the math). The blocker is purely vLLM's DSV4 model module fragility against any non-canonical checkpoint shape. Once the three filed bugs land upstream (or get a small batch PR), this checkpoint serves cleanly with --quantization turboquant.
Local workaround scripts for the two shims + the source-file patch we run during setup are in the Gate D eval harness under scripts/gpu/deepseek-v4-flash-eval/ — see setup.sh (sourcefile patch) and eval.py (config shim).
Quality
Pending:
- Wikitext-2 perplexity
- GSM8K-200
- fixed-prompt side-by-side judging
- Downloads last month
- 188
Model tree for varjosoft/DeepSeek-V4-Flash-TQ3-native
Base model
deepseek-ai/DeepSeek-V4-Flash