Apr 21, 2026arXiv:2604.19295

TEMPO: Scaling Test-time Training for Large Reasoning Models

Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu, Minghao Wu, Baosong Yang, Yu Cheng, Yun Luo, Ganqu Cui, Changqing Zhang

AI Summary

The paper introduces TEMPO, a test-time training (TTT) framework for Large Reasoning Models (LRMs) that addresses the performance plateau and diversity collapse issues of existing TTT methods. TEMPO interleaves policy refinement on unlabeled data with periodic critic recalibration on labeled data, formalizing this process as an Expectation-Maximization (EM) algorithm. Experiments across Qwen3 and OLMO3 models on reasoning tasks demonstrate that TEMPO achieves significant and sustained performance improvements, increasing accuracy on AIME 2024 by over 18% for both models while preserving diversity.

Key Contribution

Test-time training can finally scale for large reasoning models: TEMPO unlocks sustained performance gains by interleaving policy refinement with periodic critic recalibration, boosting accuracy by over 18% on challenging benchmarks.

Abstract

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TEMPO: Scaling Test-time Training for Large Reasoning Models

Related Papers