Abstract
Process Reward Agents provide domain-grounded, online step-wise rewards for frozen policies in knowledge-intensive reasoning, enabling improved search-based decoding and generalizing across different model sizes without retraining.
Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants, but these methods operate post hoc, scoring completed trajectories, which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents (PRA), a test-time method for providing domain-grounded, online, step-wise rewards to a frozen policy. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 80.8% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Importantly, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to 25.7% without any policy model updates. More broadly, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific reward modules, allowing the deployment of new backbones in complex domains without retraining.
Community
Process reward agents (PRA) is new a framework for disentangling reasoning (i.e. frozen LRM) from the domain knowledge (eg medical guidelines), here operated by a process reward agent that can search, reward - to then steer a frozen reasoning policy and improve overall and step correctness.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval (2026)
- PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training (2026)
- Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents (2026)
- Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training (2026)
- One-Token Verification for Reasoning Correctness Estimation (2026)
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance (2026)
- Improving reasoning at inference time via uncertainty minimisation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.09482 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper