arxiv:2604.19295

TEMPO: Scaling Test-time Training for Large Reasoning Models

Published on Apr 21

· Submitted by

qingyang zhang on Apr 22

Upvote

Authors:

Qingyang Zhang ,

Abstract

TEMPO is a test-time training framework that alternates policy refinement with critic recalibration to sustain performance improvements in language models without diversity collapse.

AI-generated summary

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

qingyangzhang

Paper author Paper submitter about 16 hours ago

•

edited about 13 hours ago

TEMPO: Scaling Test-time Training for Large Reasoning Models: models stop learning once training ends. Test-time training (TTT) tries to change that by letting models keep improving on new problems at inference. But current approaches plateau fast: the self-generated reward signal drifts and the model collapses into repeating one reasoning pattern. TEMPO fixes this with a simple EM-style loop: periodically recalibrate the reward critic on a small labeled set, then refine the policy on unlabeled test questions. This keeps the training signal honest as the model evolves. results: OLMO3-7B 33→51% on AIME 2024, Qwen3-14B 42→66%, still climbing at 350 steps when baselines flatline. Diversity (pass@k) stays high instead of collapsing. also generalizes to non-math reasoning tasks.