TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Paper β’ 2604.04921 β’ Published β’ 87
Pre-built Windows x64 Release binaries for the atomicmilkshake/llama-cpp-turboquant fork.
This builds adds TurboQuant (custom quantization) and TriAttention (GPU-accelerated KV cache pruning based on arXiv 2604.04921) on top of llama.cpp.
llama-turboquant-triattention-win-cu13-x64.zip (~179 MB)
llama-server.exe -m YourModel.gguf -c 32768 -ngl 99 --port 8080 ^ --triattention-stats model.triattention ^ --triattention-budget 4096 ^ --triattention-window 256 ^ --triattention-log
Tested on Qwen3-8B Q4_K_M, RTX 3080, -c 512, udget=256:
| Mode | Prune time | Generation |
|---|---|---|
| No pruning | β | 17.5 tok/s |
| CPU scoring | ~5900 ms/event | 17.5 tok/s |
| GPU scoring | ~4-9 ms/event | 75.0 tok/s |
~1000x speedup on pruning events; 4.3x overall throughput improvement.
| Flag | Description | Default |
|---|---|---|
| --triattention-stats | Calibration file (required to enable) | β |
| --triattention-budget | Max KV tokens to retain | 512 |
| --triattention-window | Recent-token protection window | 64 |
| --triattention-trigger | slack|interval|ill | slack |
| --triattention-log | Log each prune event | off |
| --triattention-no-protect-prefill | Allow evicting prompt tokens | off |
github.com/atomicmilkshake/llama-cpp-turboquant β branch eature/triattention