Search papers, labs, and topics across Lattice.
This paper analyzes mode collapse in mean-field transformer models, where token distributions degenerate during long inferences, a phenomenon not observed in practice. They demonstrate theoretically that auxiliary variables, such as positional encodings and fixed prompt insertions, prevent this mode collapse by ensuring the energy-maximizing distribution is a pushforward of the auxiliary variable distribution, avoiding concentration. They further prove that positional encoding and prompt insertion offer universality of representation in the limit, and validate their findings with mathematical experiments.
Positional encodings aren't just about position - they fundamentally prevent mean-field transformers from collapsing into meaningless single-point token distributions.
We use a mean-field-based transformer model to theoretically investigate how auxiliary variables, such as positional encoding, prevent mode collapse of self-attention mechanisms. The use of mean-field transformers to analyze the properties of self-attention mechanisms has garnered significant attention in recent years due to their ability to comprehensively analyze token interactions. However, analysis of this simple model suggests that mode collapse, where token distributions degenerate to a single point, occurs during long inferences (i.e., many layers), indicating a discrepancy with reality. This study investigates this mean-field transformer model and demonstrates that the introduction of auxiliary variables, such as positional encoding, acts as a counterforce against theoretical mode collapse. Specifically, we show that in the theoretical scheme, the energy-maximizing distribution does not degenerate to a single point; instead, it is characterized by a pushforward of the auxiliary variable distribution, thereby avoiding concentration in the Dirac measure. Our main examples are the positional encoding and the fixed prompt insertion treated as a parallel auxiliary-variable mechanism. Furthermore, we demonstrate that positional encoding and prompt insertion possess universality of representation in the limit, meaning that the limit distribution of inference can exactly represent a wide class of distributions. We also analyze several key properties of positional encoding and metastability, and validate our theoretical results through mathematical experiments.