Apr 1, 2026arXiv:2604.01411

Test-Time Scaling Makes Overtraining Compute-Optimal

Nicholas Roberts, Sung-Dae Cho, Zhiqi Gao, Tzu-Heng Huang, Albert Wu, Gabriel Orlanski, Avi Trost, Kelly A. Buchanan, Aws Albarghouthi, Frederic Sala

AI Summary

This paper introduces Train-to-Test ($T^2$) scaling laws, which jointly optimize model size, training tokens, and the number of inference samples under a fixed compute budget, addressing the limitations of pretraining scaling laws that don't account for test-time scaling. $T^2$ uses pass@$k$ modeling to capture the impact of test-time scaling and optimize pretraining decisions accordingly. The key finding is that accounting for inference costs shifts optimal pretraining decisions into the overtraining regime, which is validated through experiments showing improved performance of heavily overtrained models.

Key Contribution

Optimal LLM pretraining actually requires *overtraining* when you account for inference costs, overturning conventional scaling wisdom.

Abstract

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test ($T^2$) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. $T^2$ modernizes pretraining scaling laws with pass@$k$ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from $T^2$ are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that $T^2$ scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making $T^2$ scaling meaningful in modern deployments.

Inference & Quantization Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Test-Time Scaling Makes Overtraining Compute-Optimal

Related Papers