How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FINAL-Bench/Darwin-28B-Coder-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Darwin-28B-Coder โ€” GGUF (MTP-enabled)

GGUF builds of FINAL-Bench/Darwin-28B-Coder with the native Multi-Token Prediction (MTP) head preserved, for self-speculative decoding in llama.cpp.

Requested in the base model discussion.

Files

File Quant Size Notes
Darwin-28B-Coder-Q4_K_M.gguf Q4_K_M 16.8 GB recommended for most GPUs
Darwin-28B-Coder-Q8_0.gguf Q8_0 29.0 GB near-lossless
Darwin-28B-Coder-F16.gguf F16 54.7 GB full precision

All files include the MTP layer โ€” verified in metadata: general.architecture = qwen35, qwen35.nextn_predict_layers = 1, tensors blk.64.nextn.*.

Multi-Token Prediction (MTP)

This model ships with a trained MTP head (1 prediction layer). With a recent llama.cpp build that includes MTP support (merged in PR #22673), the nextn layer is used for self-speculative decoding โ€” typically ~1.5โ€“2ร— faster generation with identical output (the main model verifies every drafted token, so quality is unchanged).

A standard (non-MTP) GGUF does not contain the prediction head โ€” you need these MTP-enabled files to benefit from the speedup.

Usage

# 1) Build a recent llama.cpp (MTP support is in mainline since PR #22673)
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON && cmake --build build -j --config Release

# 2) Run โ€” the nextn (MTP) layer enables self-speculative decoding
./build/bin/llama-cli \
  -m Darwin-28B-Coder-Q4_K_M.gguf \
  -ngl 99 -c 8192 \
  -p "Write a quicksort in Python."

For the exact MTP/speculative flags and the latest behaviour, see the llama.cpp MTP documentation / PR #22673. Works with llama-cli and llama-server.

Model spec (public)

Architecture qwen35 (hybrid attention)
Layers 64 + 1 MTP
Hidden size 5120
Attention heads 24 (KV 4)
Context length 262,144
Vocab 248,320
Precision (source) bfloat16

License & attribution

License and usage follow the base model FINAL-Bench/Darwin-28B-Coder. These are GGUF conversions only; refer to the base model card for model details, intended use, and limitations.

GGUF conversion + quantization by the FINAL-Bench team using llama.cpp/convert_hf_to_gguf.py.

Downloads last month
-
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for FINAL-Bench/Darwin-28B-Coder-GGUF

Quantized
(5)
this model