Search papers, labs, and topics across Lattice.
This paper investigates massive activation outliers and attention sinks in Transformer language models, revealing their distinct roles and the architectural factors driving their co-occurrence. Through controlled experiments, the authors demonstrate that massive activations create persistent, near-constant hidden representations acting as implicit model parameters, while attention sinks bias individual heads toward short-range dependencies. They identify the pre-norm configuration as the key architectural component enabling the co-occurrence of these phenomena, showing that removing it decouples them.
Pre-normalization in Transformers is the culprit behind the mysterious link between massive activation outliers and attention sinks, but decoupling them reveals their distinct functions: global parameterization vs. local attention modulation.
We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.