Search papers, labs, and topics across Lattice.
This paper introduces a gating mechanism to dynamically replace Transformer MLP layers with linear surrogates, aiming to quantify and reallocate the MLP's nonlinearity budget. Experiments across various models and datasets reveal that a significant portion (25-56%) of MLP computations are near-linear and can be replaced with linear transformations at minimal perplexity cost. Furthermore, the authors demonstrate that selectively linearizing specific layers can even improve performance compared to the baseline, suggesting that some nonlinear MLPs are actively detrimental.
Transformers waste up to 56% of their MLP compute on near-linear operations, and selectively replacing nonlinear layers with linear ones can actually *improve* performance.
We investigate when transformer MLP nonlinearity is actually necessary. A gate with $d+1$ parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ($r < 0.05$). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at <1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B's full 32-layer sweep reveals one layer that narrowly beats baseline. As a proof of concept, we progressively replace middle-layer MLPs with frozen linear matrices: 5 of 24 layers linearize at zero cost. With a full training budget, 4 linearized layers yield a 10.2% perplexity improvement -- and a two-phase gated approach pushes this to 17.3%, beating a vanilla fine-tuning control and confirming that the nonlinear MLPs at these layers were actively harmful.