TEMPO: Scaling Test-time Training for Large Reasoning Models
Abstract
TEMPO is a test-time training framework that alternates policy refinement with critic recalibration to sustain performance improvements in language models without diversity collapse.
Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.
Community
TEMPO: Scaling Test-time Training for Large Reasoning Models: models stop learning once training ends. Test-time training (TTT) tries to change that by letting models keep improving on new problems at inference. But current approaches plateau fast: the self-generated reward signal drifts and the model collapses into repeating one reasoning pattern. TEMPO fixes this with a simple EM-style loop: periodically recalibrate the reward critic on a small labeled set, then refine the policy on unlabeled test questions. This keeps the training signal honest as the model evolves. results: OLMO3-7B 33→51% on AIME 2024, Qwen3-14B 42→66%, still climbing at 350 steps when baselines flatline. Diversity (pass@k) stays high instead of collapsing. also generalizes to non-math reasoning tasks.
Get this paper in your agent:
hf papers read 2604.19295 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper