DeepSeek V4 Pro · GGUF

GGUF quantizations of deepseek-ai/DeepSeek-V4-Pro for use with the V4-aware llama.cpp fork at cchuter/llama.cpp @ feat/v4-port-cuda.

📦 Required: V4-aware llama.cpp fork. These quants don't load on upstream ggml-org/llama.cpp — V4 architecture support (compressor decode, hyperconnection, lightning indexer, FP8 KV simulation, NextN heads) lives only in the fork:

git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp

Full build + run instructions in Loading below.

🖥️ Supported backends: Apple Silicon (Metal), NVIDIA CUDA (Ada/Blackwell), and CPU. All 5 V4 custom ops (ggml_dsv4_rope_tail, ggml_dsv4_hc_split_sinkhorn, ggml_dsv4_hc_weighted_sum, ggml_dsv4_hc_expand, ggml_dsv4_fp8_kv_quantize) have Metal kernels AND CUDA kernels in this fork (validated 19/19 on RTX 5090, CUDA 12.8, SM_120 native). The CUDA FP8 path is gated behind __CUDA_ARCH__ >= 890; older NVIDIA hardware (Volta/Turing/Ampere) uses a software-emulated FP8 path that builds cleanly under -DCMAKE_CUDA_ARCHITECTURES=70 but hasn't been runtime-validated yet. CUDA testers wanted — file issues at the fork if you hit problems. V4 Pro's size also means most quants need multi-GPU or CPU+GPU partial offload; see size note below. ROCm / Vulkan / Metal-on-AMD have no V4 kernels and will fail at the first dsv4 op.

📐 V4 Pro is much larger than V4 Flash (61 layers × 384 routed experts; ~1.5 TiB BF16-experts-Q8 staging GGUF vs ~282 GiB Q8 for Flash). Even Q2_K-XL of Pro at 535 GiB exceeds 512 GiB unified RAM on a single Mac Studio — inference works but pages heavily. Practical fit on a single Studio is the smaller K-quants.

Available quants

Quant Size BPW Shards Decode (M3 Ultra) gate-tools Notes
Q8_0 ~1.46 TiB 8.50 30 build-validated only not run Reference. Exceeds 512 GiB unified RAM by ~3× — needs a host with 1.5 TiB RAM or heavy swap.
Q4_K_M-XL ~828 GiB 4.85 21 build-validated only not run K-quant body, V4-specific tensors pinned at Q8_0. Recommended if you have ~1 TiB RAM; otherwise pages from disk.
Q2_K-XL ~498 GiB 2.90 13 ~0.27 t/s prompt eval, ~0.18 t/s generation (CPU mmap, -ngl 0) ✓ pass XL-pinned K-quant. Tested: loads, runs, returns valid tool_calls for the V4 fork's tests/v4-port/tool-call-fixture.json ("What is the weather in Paris?" → get_weather({"city":"Paris"})). Fits CPU mmap path on 512 GiB Studio without OOM; recommended single-Studio variant.
imatrix/dsml.jinja ~5 KiB DSML chat template — pass via --chat-template-file for any quant whose shard 1 lacks the baked template. (All three quants here have it injected.)

-XL suffix means non-expert tensors (output_tensor, token_embd, attention projections, attention compressors, hyper-connection mixers, lightning indexer, NextN heads) are pinned at Q8_0; only the routed and shared experts use the named quant body. Same recipe as the V4 Flash fork's -XL variants.

Why no IQ-class quants in this release. V4 Pro's compressed-attention decode path generates a graph too large to fit Metal's recommendedMaxWorkingSetSize on M3 Ultra (487 GiB) when the model is also on Metal — both -ngl 999 and -ngl 25 partial-offload OOM during the first command buffer. CPU-only llama-imatrix runs at ~0.79 t/s prompt eval, and a single 4096-token chunk would take ~85 minutes; 1000 chunks is ~25 days. --cpu-moe (experts on CPU, rest on Metal) hangs at the load-tensors stage. Without a working imatrix, IQ1_*/IQ2_* quants cannot be built (the converter requires it for output_hc_fn.weight). On a host with ≥1.5 TiB unified RAM (or split-machine inference), the IQ-class ladder should be reachable; this release is the K-quant slice that builds end-to-end on a single 512 GiB Studio.

Loading

# Clone the V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
cd llama.cpp

# Build for Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON && cmake --build build -j

# OR build for NVIDIA CUDA. V4 Pro almost always needs multi-GPU (Q2_K-XL is 498 GiB).
# Pick your GPU's compute capability:
#   sm_70 V100 | sm_75 T4 | sm_80 A100 | sm_86 RTX 3090/3080
#   sm_89 RTX 4090/6000 Ada/L40 | sm_90 H100/H200 | sm_120 RTX 5090/5080
# FP8 native path needs SM_89+ AND CUDA toolkit >= 11.8; older arches use the
# software-emulated FP8 path automatically. SM_120 native additionally needs
# toolkit >= 12.8 (older toolkits fall back to PTX JIT).
#
# Multi-GPU: pass the SCHED flag to BOTH compiler groups so the macro
# propagates to .cu translation units. CXX-only is silently no-op on the CUDA
# side. V4's dense per-layer inputs exceed the upstream scheduler default of
# 30 at multi-device split boundaries. Cost: ~200 MB extra scheduler memory.
# cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release \
#   -DCMAKE_CUDA_ARCHITECTURES="<your-sm>" \
#   -DCMAKE_CXX_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128 \
#   -DCMAKE_CUDA_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128 \
#   && cmake --build build -j

# Download a quant that fits your RAM/disk budget
hf download teamblobfish/DeepSeek-V4-Pro-GGUF \
  --include "Q2_K-XL/*" \
  --local-dir ~/models/DeepSeek-V4-Pro-GGUF

# Run server (point at first shard; auto-loads the rest)
./build/bin/llama-server \
  --model ~/models/DeepSeek-V4-Pro-GGUF/Q2_K-XL/DeepSeek-V4-Pro-Q2_K-XL-00001-of-00013.gguf \
  --jinja \
  --reasoning off \
  --ctx-size 65536 \
  --n-gpu-layers 0 \
  --no-repack \
  --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

⚙️ -ngl choice on M3 Ultra (512 GiB): -ngl 0 (CPU mmap) is the only configuration that loads V4 Pro Q2_K-XL/Q4_K_M-XL/Q8_0 cleanly without Metal OOM. Partial Metal offload (-ngl 1..N) hits kIOGPUCommandBufferCallbackErrorOutOfMemory during graph compute — V4's compressor decode path allocates intermediate buffers Metal can't satisfy when most weights are also Metal-resident. Full Metal (-ngl 999) only fits if the quant is below ~480 GiB total. Hosts with multiple GPUs / split-tensor offload across machines should work as expected.

⚙️ -cmoe (CPU MoE) on CUDA hosts — stick with -ub 128. -cmoe overrides MoE weights to CPU but doesn't directly control where the op runs. CUDA's op_offload defaults to true, and the CUDA backend offloads host-weight ops to GPU when batch_size ≥ 32 (see ggml/src/ggml-cuda/ggml-cuda.cu). Compute buffers are sized peak-liveness for n_ubatch-token graphs, so doubling -ub roughly doubles the GPU compute buffer. Reported by @fairydreaming running V4 Pro Q4_K_M on an RTX PRO 6000 Max-Q (96 GB, ~35 GB post-load headroom): -ub 128 fits; -ub 512 OOMs at load. Two options:

  • Recommended: keep -ub 128 for -cmoe runs on V4 Pro — best perf in this configuration.
  • Or pass --op-offload false to keep MoE compute truly on CPU regardless of ubatch — smaller GPU compute buffer, but slower if your CPU memory bandwidth is the bottleneck.

Sampling values match the model card recommendation (temperature=1.0, top_p=1.0); --reasoning off is the cleanest baseline for agent workloads.

Multi-GPU CUDA (work in progress)

⚠️ Status: WIP. Multi-GPU CUDA via --split-mode layer (default) is the recommended config for V4 Pro on any host with multiple NVIDIA GPUs — Pro is too large for any single consumer/workstation card, so multi-GPU is the realistic deployment. Layer-split is validated working on the V4 Flash sibling repo at 19 t/s on 2× RTX 6000 Ada, and an external tester is running V4 Pro across 8× A100 with our merged fix. Tensor-parallel (--split-mode row) is implemented but currently slower than layer split for V4 decode — not recommended yet. Expect quirks; please file issues at the fork.

Recommended config for fastest t/s on multi-GPU CUDA:

# Combined VRAM >= quant size (e.g. 8x A100 80GB = 640 GiB easily holds Q4_K_M-XL @ 828 GiB? — no; see -cmoe variant below)
./build/bin/llama-server \
  --model ~/models/DeepSeek-V4-Pro-GGUF/<quant>/<first-shard>.gguf \
  --jinja --reasoning off \
  --ctx-size 8192 \
  --n-gpu-layers 999 \
  --split-mode layer \
  --flash-attn on \
  --no-repack \
  --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

If the quant doesn't fit your combined VRAM (Q2_K-XL @ 498 GiB needs ≥520 GiB headroom; Q4_K_M-XL @ 828 GiB needs ≥870 GiB), add -cmoe -ub 128 per the next callout — experts move to CPU and you trade decode speed for fit.

Quirks worth knowing

  • --cache-type-k|v q8_0 is silently overridden to f16 on V4. Inherited V4 Flash quirk — V4's K is FP8-quantized at write time, breaking q8_0's per-block stationarity assumption.
  • --no-repack is required for V4 quants in CPU mode on hosts smaller than ~600 GiB RAM. Inherited V4 Flash quirk.
  • graph_max_nodes was bumped in this fork from 524288 → 2097152 to fit V4 Pro's wider compressor decode path. Older V4 builds will GGML_ASSERT on dsv4_build_compressor_decode_projected → ggml_set_rows when loading any Pro quant.
  • convert_hf_to_gguf.py --use-temp-file is required for V4 Pro. Without it, the in-memory tensor buffer exceeds 512 GiB RAM and the converter is killed by Jetsam on macOS.
  • Validation gates: tests/v4-port/run-all-gates.sh in the fork. Per-quant gate-tools runs were skipped on this release because every load is ~10 min on Pro at 512 GiB RAM; users with more RAM should re-run gates locally.

Provenance

  • Source: deepseek-ai/DeepSeek-V4-Pro HF safetensors (FP8 e4m3 weights, FP4 routed experts).
  • bf16-experts-Q8 staging GGUF (not published): built via convert_hf_to_gguf.py --outtype bf16 --deepseek4-expert-outtypes "w1=q8_0,w2=q8_0,w3=q8_0" --use-temp-file --deepseek4-expert-workers 16. Used as the source for Q2_K-XL and Q4_K_M-XL.
  • Q8_0: built via llama-quantize from the bf16-experts-Q8 staging GGUF (the runbook's safetensors → Q8_0 path was avoided for disk reasons; Q8 from BF16 has the same quant-hop count as Q8 from safetensors). No imatrix used (Q8 doesn't benefit).
  • Q4_K_M-XL / Q2_K-XL: produced via llama-quantize with the V4 fork's V4-tensor pin recipe (output_hc=q8_0, attn_compressor_*=q8_0, attn_q_a/b, attn_kv, attn_output_a/b, hc_attn=q8_0, hc_ffn=q8_0, indexer=q8_0, nextn=q8_0). No imatrix — all three K-quants here build cleanly without it (only IQ-class quants strictly require it).
  • Chat template: baked into shard 1 of every quant via gguf-py/gguf/scripts/gguf_new_metadata.py --chat-template "$(cat dsml.jinja)" after split.

License

MIT, matching the upstream DeepSeek V4 Pro license.

Downloads last month
1,859
GGUF
Model size
1.6T params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

2-bit

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for teamblobfish/DeepSeek-V4-Pro-GGUF

Quantized
(9)
this model