Search papers, labs, and topics across Lattice.
This paper explores the fixed exponents in neural scaling laws, attributing them to fundamental mechanisms such as time scaling from Softmax nonlinearity and inverse scaling with model width and depth. The authors assert that while these exponents remain constant across various architectures, the coefficients are highly sensitive to specific data and architectural choices, directly influencing model performance. By emphasizing the importance of understanding these coefficients, the paper highlights potential avenues for optimizing model design and performance in large language models.
Fixed exponents in neural scaling laws reveal that optimizing coefficients could unlock significant performance gains in large language models.
Neural scaling laws describe how pre-training loss decays as power laws with training time, model size, and compute. This position paper argues that the exponents of these power laws are fixed by generic mechanisms: a one-third time scaling due to the strong nonlinearity of Softmax, an inverse width scaling due to representational superposition, and an inverse depth scaling due to ensemble averaging of Transformer layers. These mechanisms are robust to a wide range of data structures and architectural details, placing current large language models in a universality class with fixed exponents. The coefficients, however, are expected to be sensitive to data and architecture details, and directly determine practical quantities such as the optimal model shape and the compute-optimal frontier. We therefore argue that understanding the coefficients is the key to near-term performance improvements, and that a closer examination of the current universality class may reveal pathways to better universality classes.