arxiv:2606.19808

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

Published on Jun 18

· Submitted by

Sajib Acharjee Dip on Jun 18

Virginia Polytechnic Institute and State University

Upvote

Authors:

Abstract

Selective verification approaches optimize test-time reasoning by dynamically deciding when to verify answers, achieving better accuracy and efficiency compared to always-verifying or self-consistency methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

Sajib-006

Paper submitter about 10 hours ago

•

edited about 6 hours ago

Test-time reasoning is often treated as a simple knob: give the model more tokens, ask it to verify, or let it try again. But extra reasoning is not always helpful. It can fix failed attempts, waste compute on answers that were already correct, or even flip a correct answer into a wrong one.

We study this as a deployment allocation problem: when should a reasoning system accept its first answer, and when should it spend extra compute on verification?

We introduce SEVRA — Selective Verification for Reasoning Allocation — a serving-layer controller that decides whether to preserve a frozen solver’s initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log post-generation intervention outcomes and train recoverability-aware gates from serving-visible signals such as completion status, token use, finalizer use, and attempt state.

Across benchmarks, SEVRA shows that selective verification can improve reliability and reduce unnecessary verification. On MATH500, it reaches 76.3% accuracy compared with 75.5% for always verifying, while reducing post-generation tokens by 26.8% and harmful flips from 2.2% to 1.0%. On GSM8K, it verifies only 3.0% of examples, improves accuracy from 93.4% to 94.5%, and reduces verification tokens by 91.2% relative to always verifying.

However, the story is not simply “verify more.” A longer initial solve matches selective verification on the math benchmarks with fewer realized tokens, and on CommonsenseQA, always-on verification actually hurts. This suggests that verification is useful as a recovery and auditability mechanism, but it is not always the best compute allocation.

The practical takeaway is:

Tune the initial reasoning budget first. Then use selective verification when explicit checks, bounded retries, auditability, or regression-risk control matter.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.19808

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.19808 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.19808 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.19808 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.