arxiv:2606.28661

When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling

Published on Jun 27

· Submitted by

Yong Yi Bay on Jul 2

Upvote

Authors:

Yong Yi Bay ,

Abstract

Sampling-based reasoning systems face a trade-off between coverage and selection, where additional samples beyond a few dozen provide diminishing returns and can degrade performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a correct answer turns up somewhere, so coverage, the fraction of problems with at least one correct try, climbs and appears to be progress. But a deployed system must return one answer, and choosing it, not knowing which try is right, is selection; selection is capped, and past a point extra samples only make the model surer of a confident mistake, even as every draw adds cost. The gap between climbing coverage and stalled selection, the identifiability gap, is the answer a model can produce but not pick. So the real question is not whether to sample but how far, and the answer is: not far. For picking an answer, the vote has already settled within a few dozen draws, the modal ceiling; for scoring a benchmark, sooner still, the correlation ceiling. Beyond that, extra draws cost compute and add nothing, and can even make the answer worse. This paper turns the cutoff into a single number, the effective number of samples, that any sampling run already reveals. The bottleneck is recognizing a right answer, not generating one.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

bay-yearick-lab

Paper author Paper submitter about 2 hours ago

When does the 10,001st sample stop helping? And can more sampling ever hurt? Reframing single-model sampling as cluster sampling answers both: effective draws saturate at a hard correlation ceiling 1/ρ (about 2 on released logs), and selection is capped by a modal ceiling π_mode that anti-scales where the mode is wrong. Coverage climbing while majority voting plateaus is an identifiability gap, not a compute limit. Measured on public logs, fully reproducible: https://github.com/bay-yearick-lab/sampling-ceilings

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.28661

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.28661 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.28661 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.28661 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.