llama.cpp TurboQuant + TriAttention — Windows CUDA 13 Binaries

Pre-built Windows x64 Release binaries for the atomicmilkshake/llama-cpp-turboquant fork.

This builds adds TurboQuant (custom quantization) and TriAttention (GPU-accelerated KV cache pruning based on arXiv 2604.04921) on top of llama.cpp.

Download

llama-turboquant-triattention-win-cu13-x64.zip (~179 MB)

Requirements

Windows 10/11 x64
NVIDIA GPU (Turing+, GTX 1600 / RTX 2000 series or newer)
CUDA 13.x runtime — install from developer.nvidia.com/cuda-downloads (the cublasLt64_13.dll is NOT included in the zip due to its 432 MB size)

Usage

llama-server.exe -m YourModel.gguf -c 32768 -ngl 99 --port 8080 ^ --triattention-stats model.triattention ^ --triattention-budget 4096 ^ --triattention-window 256 ^ --triattention-log

TriAttention Performance

Tested on Qwen3-8B Q4_K_M, RTX 3080, -c 512, udget=256:

Mode	Prune time	Generation
No pruning	—	17.5 tok/s
CPU scoring	~5900 ms/event	17.5 tok/s
GPU scoring	~4-9 ms/event	75.0 tok/s

~1000x speedup on pruning events; 4.3x overall throughput improvement.

TriAttention Flags

Flag	Description	Default
--triattention-stats	Calibration file (required to enable)	—
--triattention-budget	Max KV tokens to retain	512
--triattention-window	Recent-token protection window	64
--triattention-trigger	slack\|interval\|ill	slack
--triattention-log	Log each prune event	off
--triattention-no-protect-prefill	Allow evicting prompt tokens	off

Source

github.com/atomicmilkshake/llama-cpp-turboquant — branch eature/triattention

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for atomicmilkshake/llama-cpp-turboquant-binaries

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Paper • 2604.04921 • Published 4 days ago • 87