llama.cpp TurboQuant + TriAttention β€” Windows CUDA 13 Binaries

Pre-built Windows x64 Release binaries for the atomicmilkshake/llama-cpp-turboquant fork.

This builds adds TurboQuant (custom quantization) and TriAttention (GPU-accelerated KV cache pruning based on arXiv 2604.04921) on top of llama.cpp.

Download

llama-turboquant-triattention-win-cu13-x64.zip (~179 MB)

Requirements

  • Windows 10/11 x64
  • NVIDIA GPU (Turing+, GTX 1600 / RTX 2000 series or newer)
  • CUDA 13.x runtime β€” install from developer.nvidia.com/cuda-downloads (the cublasLt64_13.dll is NOT included in the zip due to its 432 MB size)

Usage

llama-server.exe -m YourModel.gguf -c 32768 -ngl 99 --port 8080 ^ --triattention-stats model.triattention ^ --triattention-budget 4096 ^ --triattention-window 256 ^ --triattention-log

TriAttention Performance

Tested on Qwen3-8B Q4_K_M, RTX 3080, -c 512, udget=256:

Mode Prune time Generation
No pruning β€” 17.5 tok/s
CPU scoring ~5900 ms/event 17.5 tok/s
GPU scoring ~4-9 ms/event 75.0 tok/s

~1000x speedup on pruning events; 4.3x overall throughput improvement.

TriAttention Flags

Flag Description Default
--triattention-stats Calibration file (required to enable) β€”
--triattention-budget Max KV tokens to retain 512
--triattention-window Recent-token protection window 64
--triattention-trigger slack|interval| ill slack
--triattention-log Log each prune event off
--triattention-no-protect-prefill Allow evicting prompt tokens off

Source

github.com/atomicmilkshake/llama-cpp-turboquant β€” branch eature/triattention

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for atomicmilkshake/llama-cpp-turboquant-binaries