Papers
arxiv:2604.22583

Adaptive Head Budgeting for Efficient Multi-Head Attention

Published on Jun 4
Authors:
,
,
,

Abstract

BudgetFormer dynamically allocates attention heads in Transformers based on input complexity, reducing computational overhead while maintaining or improving performance on text classification tasks.

Multi-head attention enables Transformers to capture diverse representations, but all attention heads are typically activated for every input, regardless of task complexity. For coarse-grained tasks such as text classification, where relevant information is often global, this fixed allocation can introduce unnecessary computation. We propose BudgetFormer, a Transformer architecture that dynamically allocates attention heads on a per-input basis. The model learns both a head budget and a relevance distribution to select the most informative heads. To support effective head selection, we introduce a training strategy that balances exploration and exploitation. Experiments on text classification tasks show that BudgetFormer reduces FLOPs and memory usage while matching or surpassing the performance of standard multi-head attention. These results highlight adaptive head allocation as an effective approach to improving Transformer efficiency and performance.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.22583
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.22583 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.22583 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.22583 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.