Mar 4, 2026arXiv:2603.03993

Specialization of softmax attention heads: insights from the high-dimensional single-location model

AI Summary

This paper presents a theoretical model, based on multi-index and single-location regression, to explain the specialization of multi-head attention in transformers. It reveals that training dynamics under SGD consist of an initial unspecialized phase followed by a multi-stage specialization phase where heads align with latent signal directions. The paper also introduces Bayes-softmax attention, which achieves optimal prediction performance in the model, and demonstrates that softmax-1 reduces noise from irrelevant heads.

Key Contribution

Softmax attention heads specialize in stages during training, and a novel Bayes-softmax attention can achieve optimal prediction performance by reducing noise from irrelevant heads.

Abstract

Multi-head attention enables transformer models to represent multiple attention patterns simultaneously. Empirically, head specialization emerges in distinct stages during training, while many heads remain redundant and learn similar representations. We propose a theoretical model capturing this phenomenon, based on the multi-index and single-location regression frameworks. In the first part, we analyze the training dynamics of multi-head softmax attention under SGD, revealing an initial unspecialized phase followed by a multi-stage specialization phase in which different heads sequentially align with latent signal directions. In the second part, we study the impact of attention activation functions on performance. We show that softmax-1 significantly reduces noise from irrelevant heads. Finally, we introduce the Bayes-softmax attention, which achieves optimal prediction performance in this setting.

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Specialization of softmax attention heads: insights from the high-dimensional single-location model

Related Papers