Meta AINYUMar 5, 2026arXiv:2603.05498

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Shangwen Sun, A. Canziani, Alfredo Canziani, Yann LeCun, Jiachen Zhu

AI Summary

This paper investigates massive activation outliers and attention sinks in Transformer language models, revealing their distinct roles and the architectural factors driving their co-occurrence. Through controlled experiments, the authors demonstrate that massive activations create persistent, near-constant hidden representations acting as implicit model parameters, while attention sinks bias individual heads toward short-range dependencies. They identify the pre-norm configuration as the key architectural component enabling the co-occurrence of these phenomena, showing that removing it decouples them.

Key Contribution

Pre-normalization in Transformers is the culprit behind the mysterious link between massive activation outliers and attention sinks, but decoupling them reveals their distinct functions: global parameterization vs. local attention modulation.

Abstract

We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References81

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Related Papers