How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs
Abstract
FlowTracer is an RL framework that uses attention-induced graphs to trace reasoning flows and assign token-level credit based on global information propagation structures.
Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.
Community
What if we stopped rewarding every token equally and instead followed the actual "reasoning bloodstream" inside an LLM? FlowTracer tackles one of RL’s messiest blind spots—token-level credit assignment—by turning attention patterns into a directed information-flow graph, tracing how evidence travels from the question to the final answer, and identifying the high-impact reasoning hubs that truly matter. Instead of spraying reward over fluent filler and decisive steps alike, it shapes RL signals around the tokens that route information toward correctness, offering a sharper way to train reasoning models by asking not just what they answered, but how the answer flowed into existence.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models (2026)
- GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards (2026)
- ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning (2026)
- GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization (2026)
- GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning (2026)
- PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping (2026)
- Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.10646 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper