Instructions to use teamblobfish/DeepSeek-V4-Pro-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="teamblobfish/DeepSeek-V4-Pro-GGUF",
	filename="Q2_K-XL/DeepSeek-V4-Pro-Q2_K-XL-00001-of-00013.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Use Docker

docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "teamblobfish/DeepSeek-V4-Pro-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "teamblobfish/DeepSeek-V4-Pro-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Ollama
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Ollama:
```
ollama run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
```

Unsloth Studio new

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for teamblobfish/DeepSeek-V4-Pro-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for teamblobfish/DeepSeek-V4-Pro-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for teamblobfish/DeepSeek-V4-Pro-GGUF to start chatting

Pi new

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Docker Model Runner:
```
docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
```

Lemonade

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.DeepSeek-V4-Pro-GGUF-Q4_K_M

List all available models

lemonade list

DeepSeek V4 Pro · GGUF

GGUF quantizations of deepseek-ai/DeepSeek-V4-Pro for use with the V4-aware llama.cpp fork at cchuter/llama.cpp @ feat/v4-port-cuda.

📦 Required: V4-aware llama.cpp fork. These quants don't load on upstream ggml-org/llama.cpp — V4 architecture support (compressor decode, hyperconnection, lightning indexer, FP8 KV simulation, NextN heads) lives only in the fork:
git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
Full build + run instructions in Loading below.

🖥️ Supported backends: Apple Silicon (Metal), NVIDIA CUDA (Ada/Blackwell), and CPU. All 5 V4 custom ops (ggml_dsv4_rope_tail, ggml_dsv4_hc_split_sinkhorn, ggml_dsv4_hc_weighted_sum, ggml_dsv4_hc_expand, ggml_dsv4_fp8_kv_quantize) have Metal kernels AND CUDA kernels in this fork (validated 19/19 on RTX 5090, CUDA 12.8, SM_120 native). The CUDA FP8 path is gated behind __CUDA_ARCH__ >= 890; older NVIDIA hardware (Volta/Turing/Ampere) uses a software-emulated FP8 path that builds cleanly under -DCMAKE_CUDA_ARCHITECTURES=70 but hasn't been runtime-validated yet. CUDA testers wanted — file issues at the fork if you hit problems. V4 Pro's size also means most quants need multi-GPU or CPU+GPU partial offload; see size note below. ROCm / Vulkan / Metal-on-AMD have no V4 kernels and will fail at the first dsv4 op.

📐 V4 Pro is much larger than V4 Flash (61 layers × 384 routed experts; ~1.5 TiB BF16-experts-Q8 staging GGUF vs ~282 GiB Q8 for Flash). Even Q2_K-XL of Pro at 535 GiB exceeds 512 GiB unified RAM on a single Mac Studio — inference works but pages heavily. Practical fit on a single Studio is the smaller K-quants.

Available quants

Quant	Size	BPW	Shards	Decode (M3 Ultra)	gate-tools	Notes
Q8_0	~1.46 TiB	8.50	30	build-validated only	not run	Reference. Exceeds 512 GiB unified RAM by ~3× — needs a host with 1.5 TiB RAM or heavy swap.
Q4_K_M-XL	~828 GiB	4.85	21	build-validated only	not run	K-quant body, V4-specific tensors pinned at Q8_0. Recommended if you have ~1 TiB RAM; otherwise pages from disk.
Q2_K-XL	~498 GiB	2.90	13	~0.27 t/s prompt eval, ~0.18 t/s generation (CPU mmap, -ngl 0)	✓ pass	XL-pinned K-quant. Tested: loads, runs, returns valid `tool_calls` for the V4 fork's `tests/v4-port/tool-call-fixture.json` ("What is the weather in Paris?" → `get_weather({"city":"Paris"})`). Fits CPU mmap path on 512 GiB Studio without OOM; recommended single-Studio variant.
`imatrix/dsml.jinja`	~5 KiB	—	—	—	—	DSML chat template — pass via `--chat-template-file` for any quant whose shard 1 lacks the baked template. (All three quants here have it injected.)

-XL suffix means non-expert tensors (output_tensor, token_embd, attention projections, attention compressors, hyper-connection mixers, lightning indexer, NextN heads) are pinned at Q8_0; only the routed and shared experts use the named quant body. Same recipe as the V4 Flash fork's -XL variants.

Why no IQ-class quants in this release. V4 Pro's compressed-attention decode path generates a graph too large to fit Metal's recommendedMaxWorkingSetSize on M3 Ultra (487 GiB) when the model is also on Metal — both -ngl 999 and -ngl 25 partial-offload OOM during the first command buffer. CPU-only llama-imatrix runs at ~0.79 t/s prompt eval, and a single 4096-token chunk would take ~85 minutes; 1000 chunks is ~25 days. --cpu-moe (experts on CPU, rest on Metal) hangs at the load-tensors stage. Without a working imatrix, IQ1_*/IQ2_* quants cannot be built (the converter requires it for output_hc_fn.weight). On a host with ≥1.5 TiB unified RAM (or split-machine inference), the IQ-class ladder should be reachable; this release is the K-quant slice that builds end-to-end on a single 512 GiB Studio.

Loading

# Clone the V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
cd llama.cpp

# Build for Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON && cmake --build build -j

# OR build for NVIDIA CUDA. V4 Pro almost always needs multi-GPU (Q2_K-XL is 498 GiB).
# Pick your GPU's compute capability:
#   sm_70 V100 | sm_75 T4 | sm_80 A100 | sm_86 RTX 3090/3080
#   sm_89 RTX 4090/6000 Ada/L40 | sm_90 H100/H200 | sm_120 RTX 5090/5080
# FP8 native path needs SM_89+ AND CUDA toolkit >= 11.8; older arches use the
# software-emulated FP8 path automatically. SM_120 native additionally needs
# toolkit >= 12.8 (older toolkits fall back to PTX JIT).
#
# Multi-GPU: pass the SCHED flag to BOTH compiler groups so the macro
# propagates to .cu translation units. CXX-only is silently no-op on the CUDA
# side. V4's dense per-layer inputs exceed the upstream scheduler default of
# 30 at multi-device split boundaries. Cost: ~200 MB extra scheduler memory.
# cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release \
#   -DCMAKE_CUDA_ARCHITECTURES="<your-sm>" \
#   -DCMAKE_CXX_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128 \
#   -DCMAKE_CUDA_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128 \
#   && cmake --build build -j

# Download a quant that fits your RAM/disk budget
hf download teamblobfish/DeepSeek-V4-Pro-GGUF \
  --include "Q2_K-XL/*" \
  --local-dir ~/models/DeepSeek-V4-Pro-GGUF

# Run server (point at first shard; auto-loads the rest)
./build/bin/llama-server \
  --model ~/models/DeepSeek-V4-Pro-GGUF/Q2_K-XL/DeepSeek-V4-Pro-Q2_K-XL-00001-of-00013.gguf \
  --jinja \
  --reasoning off \
  --ctx-size 65536 \
  --n-gpu-layers 0 \
  --no-repack \
  --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

⚙️ -ngl choice on M3 Ultra (512 GiB): -ngl 0 (CPU mmap) is the only configuration that loads V4 Pro Q2_K-XL/Q4_K_M-XL/Q8_0 cleanly without Metal OOM. Partial Metal offload (-ngl 1..N) hits kIOGPUCommandBufferCallbackErrorOutOfMemory during graph compute — V4's compressor decode path allocates intermediate buffers Metal can't satisfy when most weights are also Metal-resident. Full Metal (-ngl 999) only fits if the quant is below ~480 GiB total. Hosts with multiple GPUs / split-tensor offload across machines should work as expected.

⚙️ -cmoe (CPU MoE) on CUDA hosts — stick with -ub 128. -cmoe overrides MoE weights to CPU but doesn't directly control where the op runs. CUDA's op_offload defaults to true, and the CUDA backend offloads host-weight ops to GPU when batch_size ≥ 32 (see ggml/src/ggml-cuda/ggml-cuda.cu). Compute buffers are sized peak-liveness for n_ubatch-token graphs, so doubling -ub roughly doubles the GPU compute buffer. Reported by @fairydreaming running V4 Pro Q4_K_M on an RTX PRO 6000 Max-Q (96 GB, ~35 GB post-load headroom): -ub 128 fits; -ub 512 OOMs at load. Two options:

Recommended: keep -ub 128 for -cmoe runs on V4 Pro — best perf in this configuration.

Or pass --op-offload false to keep MoE compute truly on CPU regardless of ubatch — smaller GPU compute buffer, but slower if your CPU memory bandwidth is the bottleneck.

Sampling values match the model card recommendation (temperature=1.0, top_p=1.0); --reasoning off is the cleanest baseline for agent workloads.

Multi-GPU CUDA (work in progress)

⚠️ Status: WIP. Multi-GPU CUDA via --split-mode layer (default) is the recommended config for V4 Pro on any host with multiple NVIDIA GPUs — Pro is too large for any single consumer/workstation card, so multi-GPU is the realistic deployment. Layer-split is validated working on the V4 Flash sibling repo at 19 t/s on 2× RTX 6000 Ada, and an external tester is running V4 Pro across 8× A100 with our merged fix. Tensor-parallel (--split-mode row) is implemented but currently slower than layer split for V4 decode — not recommended yet. Expect quirks; please file issues at the fork.

Recommended config for fastest t/s on multi-GPU CUDA:

# Combined VRAM >= quant size (e.g. 8x A100 80GB = 640 GiB easily holds Q4_K_M-XL @ 828 GiB? — no; see -cmoe variant below)
./build/bin/llama-server \
  --model ~/models/DeepSeek-V4-Pro-GGUF/<quant>/<first-shard>.gguf \
  --jinja --reasoning off \
  --ctx-size 8192 \
  --n-gpu-layers 999 \
  --split-mode layer \
  --flash-attn on \
  --no-repack \
  --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

If the quant doesn't fit your combined VRAM (Q2_K-XL @ 498 GiB needs ≥520 GiB headroom; Q4_K_M-XL @ 828 GiB needs ≥870 GiB), add -cmoe -ub 128 per the next callout — experts move to CPU and you trade decode speed for fit.

Quirks worth knowing

--cache-type-k|v q8_0 is silently overridden to f16 on V4. Inherited V4 Flash quirk — V4's K is FP8-quantized at write time, breaking q8_0's per-block stationarity assumption.
--no-repack is required for V4 quants in CPU mode on hosts smaller than ~600 GiB RAM. Inherited V4 Flash quirk.
graph_max_nodes was bumped in this fork from 524288 → 2097152 to fit V4 Pro's wider compressor decode path. Older V4 builds will GGML_ASSERT on dsv4_build_compressor_decode_projected → ggml_set_rows when loading any Pro quant.
convert_hf_to_gguf.py --use-temp-file is required for V4 Pro. Without it, the in-memory tensor buffer exceeds 512 GiB RAM and the converter is killed by Jetsam on macOS.
Validation gates: tests/v4-port/run-all-gates.sh in the fork. Per-quant gate-tools runs were skipped on this release because every load is ~10 min on Pro at 512 GiB RAM; users with more RAM should re-run gates locally.

Provenance

Source: deepseek-ai/DeepSeek-V4-Pro HF safetensors (FP8 e4m3 weights, FP4 routed experts).
bf16-experts-Q8 staging GGUF (not published): built via convert_hf_to_gguf.py --outtype bf16 --deepseek4-expert-outtypes "w1=q8_0,w2=q8_0,w3=q8_0" --use-temp-file --deepseek4-expert-workers 16. Used as the source for Q2_K-XL and Q4_K_M-XL.
Q8_0: built via llama-quantize from the bf16-experts-Q8 staging GGUF (the runbook's safetensors → Q8_0 path was avoided for disk reasons; Q8 from BF16 has the same quant-hop count as Q8 from safetensors). No imatrix used (Q8 doesn't benefit).
Q4_K_M-XL / Q2_K-XL: produced via llama-quantize with the V4 fork's V4-tensor pin recipe (output_hc=q8_0, attn_compressor_*=q8_0, attn_q_a/b, attn_kv, attn_output_a/b, hc_attn=q8_0, hc_ffn=q8_0, indexer=q8_0, nextn=q8_0). No imatrix — all three K-quants here build cleanly without it (only IQ-class quants strictly require it).
Chat template: baked into shard 1 of every quant via gguf-py/gguf/scripts/gguf_new_metadata.py --chat-template "$(cat dsml.jinja)" after split.

License

MIT, matching the upstream DeepSeek V4 Pro license.

Downloads last month: 1,859

GGUF

Model size

1.6T params

Architecture

deepseek4

Hardware compatibility

2-bit

4-bit

8-bit

Model tree for teamblobfish/DeepSeek-V4-Pro-GGUF

Base model

deepseek-ai/DeepSeek-V4-Pro

Quantized

(9)

this model