Instructions to use teamblobfish/DeepSeek-V4-Pro-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="teamblobfish/DeepSeek-V4-Pro-GGUF", filename="Q2_K-XL/DeepSeek-V4-Pro-Q2_K-XL-00001-of-00013.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
Use Docker
docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "teamblobfish/DeepSeek-V4-Pro-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "teamblobfish/DeepSeek-V4-Pro-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
- Ollama
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Ollama:
ollama run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
- Unsloth Studio new
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for teamblobfish/DeepSeek-V4-Pro-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for teamblobfish/DeepSeek-V4-Pro-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for teamblobfish/DeepSeek-V4-Pro-GGUF to start chatting
- Pi new
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Docker Model Runner:
docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
- Lemonade
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.DeepSeek-V4-Pro-GGUF-Q4_K_M
List all available models
lemonade list
DeepSeek V4 Pro · GGUF
GGUF quantizations of deepseek-ai/DeepSeek-V4-Pro for use with the V4-aware llama.cpp fork at cchuter/llama.cpp @ feat/v4-port-cuda.
📦 Required: V4-aware llama.cpp fork. These quants don't load on upstream
ggml-org/llama.cpp— V4 architecture support (compressor decode, hyperconnection, lightning indexer, FP8 KV simulation, NextN heads) lives only in the fork:git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cppFull build + run instructions in Loading below.
🖥️ Supported backends: Apple Silicon (Metal), NVIDIA CUDA (Ada/Blackwell), and CPU. All 5 V4 custom ops (
ggml_dsv4_rope_tail,ggml_dsv4_hc_split_sinkhorn,ggml_dsv4_hc_weighted_sum,ggml_dsv4_hc_expand,ggml_dsv4_fp8_kv_quantize) have Metal kernels AND CUDA kernels in this fork (validated 19/19 on RTX 5090, CUDA 12.8, SM_120 native). The CUDA FP8 path is gated behind__CUDA_ARCH__ >= 890; older NVIDIA hardware (Volta/Turing/Ampere) uses a software-emulated FP8 path that builds cleanly under-DCMAKE_CUDA_ARCHITECTURES=70but hasn't been runtime-validated yet. CUDA testers wanted — file issues at the fork if you hit problems. V4 Pro's size also means most quants need multi-GPU or CPU+GPU partial offload; see size note below. ROCm / Vulkan / Metal-on-AMD have no V4 kernels and will fail at the first dsv4 op.
📐 V4 Pro is much larger than V4 Flash (61 layers × 384 routed experts; ~1.5 TiB BF16-experts-Q8 staging GGUF vs ~282 GiB Q8 for Flash). Even Q2_K-XL of Pro at 535 GiB exceeds 512 GiB unified RAM on a single Mac Studio — inference works but pages heavily. Practical fit on a single Studio is the smaller K-quants.
Available quants
| Quant | Size | BPW | Shards | Decode (M3 Ultra) | gate-tools | Notes |
|---|---|---|---|---|---|---|
| Q8_0 | ~1.46 TiB | 8.50 | 30 | build-validated only | not run | Reference. Exceeds 512 GiB unified RAM by ~3× — needs a host with 1.5 TiB RAM or heavy swap. |
| Q4_K_M-XL | ~828 GiB | 4.85 | 21 | build-validated only | not run | K-quant body, V4-specific tensors pinned at Q8_0. Recommended if you have ~1 TiB RAM; otherwise pages from disk. |
| Q2_K-XL | ~498 GiB | 2.90 | 13 | ~0.27 t/s prompt eval, ~0.18 t/s generation (CPU mmap, -ngl 0) | ✓ pass | XL-pinned K-quant. Tested: loads, runs, returns valid tool_calls for the V4 fork's tests/v4-port/tool-call-fixture.json ("What is the weather in Paris?" → get_weather({"city":"Paris"})). Fits CPU mmap path on 512 GiB Studio without OOM; recommended single-Studio variant. |
imatrix/dsml.jinja |
~5 KiB | — | — | — | — | DSML chat template — pass via --chat-template-file for any quant whose shard 1 lacks the baked template. (All three quants here have it injected.) |
-XL suffix means non-expert tensors (output_tensor, token_embd, attention projections, attention compressors, hyper-connection mixers, lightning indexer, NextN heads) are pinned at Q8_0; only the routed and shared experts use the named quant body. Same recipe as the V4 Flash fork's -XL variants.
Why no IQ-class quants in this release. V4 Pro's compressed-attention decode path generates a graph too large to fit Metal's
recommendedMaxWorkingSetSizeon M3 Ultra (487 GiB) when the model is also on Metal — both-ngl 999and-ngl 25partial-offload OOM during the first command buffer. CPU-onlyllama-imatrixruns at ~0.79 t/s prompt eval, and a single 4096-token chunk would take ~85 minutes; 1000 chunks is ~25 days.--cpu-moe(experts on CPU, rest on Metal) hangs at the load-tensors stage. Without a working imatrix,IQ1_*/IQ2_*quants cannot be built (the converter requires it foroutput_hc_fn.weight). On a host with ≥1.5 TiB unified RAM (or split-machine inference), the IQ-class ladder should be reachable; this release is the K-quant slice that builds end-to-end on a single 512 GiB Studio.
Loading
# Clone the V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
cd llama.cpp
# Build for Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON && cmake --build build -j
# OR build for NVIDIA CUDA. V4 Pro almost always needs multi-GPU (Q2_K-XL is 498 GiB).
# Pick your GPU's compute capability:
# sm_70 V100 | sm_75 T4 | sm_80 A100 | sm_86 RTX 3090/3080
# sm_89 RTX 4090/6000 Ada/L40 | sm_90 H100/H200 | sm_120 RTX 5090/5080
# FP8 native path needs SM_89+ AND CUDA toolkit >= 11.8; older arches use the
# software-emulated FP8 path automatically. SM_120 native additionally needs
# toolkit >= 12.8 (older toolkits fall back to PTX JIT).
#
# Multi-GPU: pass the SCHED flag to BOTH compiler groups so the macro
# propagates to .cu translation units. CXX-only is silently no-op on the CUDA
# side. V4's dense per-layer inputs exceed the upstream scheduler default of
# 30 at multi-device split boundaries. Cost: ~200 MB extra scheduler memory.
# cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release \
# -DCMAKE_CUDA_ARCHITECTURES="<your-sm>" \
# -DCMAKE_CXX_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128 \
# -DCMAKE_CUDA_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128 \
# && cmake --build build -j
# Download a quant that fits your RAM/disk budget
hf download teamblobfish/DeepSeek-V4-Pro-GGUF \
--include "Q2_K-XL/*" \
--local-dir ~/models/DeepSeek-V4-Pro-GGUF
# Run server (point at first shard; auto-loads the rest)
./build/bin/llama-server \
--model ~/models/DeepSeek-V4-Pro-GGUF/Q2_K-XL/DeepSeek-V4-Pro-Q2_K-XL-00001-of-00013.gguf \
--jinja \
--reasoning off \
--ctx-size 65536 \
--n-gpu-layers 0 \
--no-repack \
--temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0
⚙️
-nglchoice on M3 Ultra (512 GiB):-ngl 0(CPU mmap) is the only configuration that loads V4 Pro Q2_K-XL/Q4_K_M-XL/Q8_0 cleanly without Metal OOM. Partial Metal offload (-ngl 1..N) hitskIOGPUCommandBufferCallbackErrorOutOfMemoryduring graph compute — V4's compressor decode path allocates intermediate buffers Metal can't satisfy when most weights are also Metal-resident. Full Metal (-ngl 999) only fits if the quant is below ~480 GiB total. Hosts with multiple GPUs / split-tensor offload across machines should work as expected.
⚙️
-cmoe(CPU MoE) on CUDA hosts — stick with-ub 128.-cmoeoverrides MoE weights to CPU but doesn't directly control where the op runs. CUDA'sop_offloaddefaults totrue, and the CUDA backend offloads host-weight ops to GPU whenbatch_size ≥ 32(seeggml/src/ggml-cuda/ggml-cuda.cu). Compute buffers are sized peak-liveness forn_ubatch-token graphs, so doubling-ubroughly doubles the GPU compute buffer. Reported by @fairydreaming running V4 Pro Q4_K_M on an RTX PRO 6000 Max-Q (96 GB, ~35 GB post-load headroom):-ub 128fits;-ub 512OOMs at load. Two options:
- Recommended: keep
-ub 128for-cmoeruns on V4 Pro — best perf in this configuration.- Or pass
--op-offload falseto keep MoE compute truly on CPU regardless of ubatch — smaller GPU compute buffer, but slower if your CPU memory bandwidth is the bottleneck.
Sampling values match the model card recommendation (temperature=1.0, top_p=1.0); --reasoning off is the cleanest baseline for agent workloads.
Multi-GPU CUDA (work in progress)
⚠️ Status: WIP. Multi-GPU CUDA via
--split-mode layer(default) is the recommended config for V4 Pro on any host with multiple NVIDIA GPUs — Pro is too large for any single consumer/workstation card, so multi-GPU is the realistic deployment. Layer-split is validated working on the V4 Flash sibling repo at 19 t/s on 2× RTX 6000 Ada, and an external tester is running V4 Pro across 8× A100 with our merged fix. Tensor-parallel (--split-mode row) is implemented but currently slower than layer split for V4 decode — not recommended yet. Expect quirks; please file issues at the fork.
Recommended config for fastest t/s on multi-GPU CUDA:
# Combined VRAM >= quant size (e.g. 8x A100 80GB = 640 GiB easily holds Q4_K_M-XL @ 828 GiB? — no; see -cmoe variant below)
./build/bin/llama-server \
--model ~/models/DeepSeek-V4-Pro-GGUF/<quant>/<first-shard>.gguf \
--jinja --reasoning off \
--ctx-size 8192 \
--n-gpu-layers 999 \
--split-mode layer \
--flash-attn on \
--no-repack \
--temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0
If the quant doesn't fit your combined VRAM (Q2_K-XL @ 498 GiB needs ≥520 GiB headroom; Q4_K_M-XL @ 828 GiB needs ≥870 GiB), add -cmoe -ub 128 per the next callout — experts move to CPU and you trade decode speed for fit.
Quirks worth knowing
--cache-type-k|v q8_0is silently overridden to f16 on V4. Inherited V4 Flash quirk — V4's K is FP8-quantized at write time, breaking q8_0's per-block stationarity assumption.--no-repackis required for V4 quants in CPU mode on hosts smaller than ~600 GiB RAM. Inherited V4 Flash quirk.graph_max_nodeswas bumped in this fork from 524288 → 2097152 to fit V4 Pro's wider compressor decode path. Older V4 builds will GGML_ASSERT ondsv4_build_compressor_decode_projected → ggml_set_rowswhen loading any Pro quant.convert_hf_to_gguf.py --use-temp-fileis required for V4 Pro. Without it, the in-memory tensor buffer exceeds 512 GiB RAM and the converter is killed by Jetsam on macOS.- Validation gates:
tests/v4-port/run-all-gates.shin the fork. Per-quant gate-tools runs were skipped on this release because every load is ~10 min on Pro at 512 GiB RAM; users with more RAM should re-run gates locally.
Provenance
- Source:
deepseek-ai/DeepSeek-V4-ProHF safetensors (FP8 e4m3 weights, FP4 routed experts). - bf16-experts-Q8 staging GGUF (not published): built via
convert_hf_to_gguf.py --outtype bf16 --deepseek4-expert-outtypes "w1=q8_0,w2=q8_0,w3=q8_0" --use-temp-file --deepseek4-expert-workers 16. Used as the source for Q2_K-XL and Q4_K_M-XL. - Q8_0: built via
llama-quantizefrom the bf16-experts-Q8 staging GGUF (the runbook's safetensors → Q8_0 path was avoided for disk reasons; Q8 from BF16 has the same quant-hop count as Q8 from safetensors). No imatrix used (Q8 doesn't benefit). - Q4_K_M-XL / Q2_K-XL: produced via
llama-quantizewith the V4 fork's V4-tensor pin recipe (output_hc=q8_0,attn_compressor_*=q8_0,attn_q_a/b,attn_kv,attn_output_a/b,hc_attn=q8_0,hc_ffn=q8_0,indexer=q8_0,nextn=q8_0). No imatrix — all three K-quants here build cleanly without it (only IQ-class quants strictly require it). - Chat template: baked into shard 1 of every quant via
gguf-py/gguf/scripts/gguf_new_metadata.py --chat-template "$(cat dsml.jinja)"after split.
License
MIT, matching the upstream DeepSeek V4 Pro license.
- Downloads last month
- 1,859
2-bit
4-bit
8-bit
Model tree for teamblobfish/DeepSeek-V4-Pro-GGUF
Base model
deepseek-ai/DeepSeek-V4-Pro