Search papers, labs, and topics across Lattice.
This paper introduces Layerwise Learning Rate (LLR), an adaptive learning rate scheme that assigns distinct learning rates to each Transformer layer based on the heavy-tailedness of their weight correlation matrices. LLR accelerates training by assigning larger learning rates to layers with weaker heavy-tailedness and smaller learning rates to those with stronger heavy-tailedness, promoting balanced training. Experiments across various LLM architectures and scales demonstrate that LLR achieves up to 1.5x training speedup and improves zero-shot accuracy compared to uniform learning rates, with minimal tuning overhead.
LLMs train 1.5x faster and generalize better with a surprisingly simple trick: adapt learning rates per-layer based on the "heavy-tailedness" of their weight matrices.
Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures (from LLaMA to GPT-nano), optimizers (AdamW and Muon), and parameter scales (60M-1B) demonstrate that LLR achieves up to 1.5x training speedup and outperforms baselines, notably raising average zero-shot accuracy from 47.09% to 49.02%. A key advantage of LLR is its low tuning overhead: it transfers nearly optimal LR settings directly from the uniform baseline. Code is available at https://github.com/hed-ucas/Layer-wise-Learning-Rate.