Search papers, labs, and topics across Lattice.
The paper introduces Progressive Residual Warmup (ProRes), a novel pretraining technique for Transformers that gradually warms up residual connections layer-by-layer, starting with earlier layers. This approach mitigates pretraining instability by allowing shallower layers to stabilize before deeper layers contribute to the learning process. Experiments across various model scales and configurations demonstrate that ProRes stabilizes pretraining, accelerates convergence, and improves generalization and downstream performance.
By strategically warming up residual connections layer-by-layer, ProRes unlocks faster and more stable pretraining for language models.
Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.