PKUMay 26, 2026arXiv:2605.26895

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Mingze Wang, Shuchen Zhu, Yuxin Fang, Binghui Li, Kai Shen, Shu Zhong

AI Summary

This paper investigates the role of scale vectors in LLM normalization layers, finding that they significantly impact pre-training despite their small size. Through theoretical analysis, the authors demonstrate that scale vectors in Pre-Norm architectures primarily improve optimization via preconditioning rather than expressivity. Based on these insights, they propose and validate three lightweight improvements to scale vectors, which, when combined, consistently improve pre-training loss and scaling behavior across various model sizes and training configurations.

Key Contribution

Scale vectors, despite being a tiny fraction of LLM parameters, are critical for pre-training, and this paper unlocks how to make them even better with simple, theoretically-grounded tweaks.

Abstract

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Related Papers