Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning
Abstract
Selective verification approaches optimize test-time reasoning by dynamically deciding when to verify answers, achieving better accuracy and efficiency compared to always-verifying or self-consistency methods.
Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.
Community
Test-time reasoning is often treated as a simple knob: give the model more tokens, ask it to verify, or let it try again. But extra reasoning is not always helpful. It can fix failed attempts, waste compute on answers that were already correct, or even flip a correct answer into a wrong one.
We study this as a deployment allocation problem: when should a reasoning system accept its first answer, and when should it spend extra compute on verification?
We introduce SEVRA — Selective Verification for Reasoning Allocation — a serving-layer controller that decides whether to preserve a frozen solver’s initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log post-generation intervention outcomes and train recoverability-aware gates from serving-visible signals such as completion status, token use, finalizer use, and attempt state.
Across benchmarks, SEVRA shows that selective verification can improve reliability and reduce unnecessary verification. On MATH500, it reaches 76.3% accuracy compared with 75.5% for always verifying, while reducing post-generation tokens by 26.8% and harmful flips from 2.2% to 1.0%. On GSM8K, it verifies only 3.0% of examples, improves accuracy from 93.4% to 94.5%, and reduces verification tokens by 91.2% relative to always verifying.
However, the story is not simply “verify more.” A longer initial solve matches selective verification on the math benchmarks with fewer realized tokens, and on CommonsenseQA, always-on verification actually hurts. This suggests that verification is useful as a recovery and auditability mechanism, but it is not always the best compute allocation.
The practical takeaway is:
Tune the initial reasoning budget first. Then use selective verification when explicit checks, bounded retries, auditability, or regression-risk control matter.
Get this paper in your agent:
hf papers read 2606.19808 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper