School of Mechanical EngineeringMar 5, 2026arXiv:2603.05369

Progressive Residual Warmup for Language Model Pretraining

Tianhao Chen, Tianhao Chen, Xin Xu, Xin Xu, Lu Yin, Lu Yin, Hao Chen, Hao Chen, Yang Wang, Yang Wang, Shizhe Diao, Shizhe Diao, Can Yang, Can Yang

AI Summary

The paper introduces Progressive Residual Warmup (ProRes), a novel pretraining technique for Transformers that gradually warms up residual connections layer-by-layer, starting with earlier layers. This approach mitigates pretraining instability by allowing shallower layers to stabilize before deeper layers contribute to the learning process. Experiments across various model scales and configurations demonstrate that ProRes stabilizes pretraining, accelerates convergence, and improves generalization and downstream performance.

Key Contribution

By strategically warming up residual connections layer-by-layer, ProRes unlocks faster and more stable pretraining for language models.

Abstract

Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Progressive Residual Warmup for Language Model Pretraining

Related Papers