arxiv:2606.04511

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Published on Jun 3

· Submitted by

Yaosheng Fu on Jun 11

NVIDIA

Upvote

Authors:

Yaosheng Fu ,

Abstract

SparDA is a decoupled sparse attention architecture that improves long-context LLM inference by reducing KV cache bottlenecks and attention complexity through aForecast projection for lookahead selection.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains O(T^2) complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds <0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25times prefill speedup and 1.7times decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3times higher decode throughput than the non-offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.

View arXiv page View PDF GitHub 12 Add to collection

Community

fuys06

Paper author Paper submitter about 5 hours ago

SparDA adds a fourth per-layer projection, the Forecast, that predicts which KV blocks the next layer will select. Running selection one layer ahead lets it both shrink the indexer (one head per GQA group, no softmax) and overlap CPU-to-GPU KV prefetch with compute. It's a lightweight add-on trained on top of an existing sparse backbone without retraining the base model. While we have only evaluated SparDA on 8B models, we hope it can shed light on inference optimization for larger frontier models with sparse attention capability, including DeepSeek-V4, GLM-5.1, and MiniMax M3.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.04511 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.04511 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.04511 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.