Search papers, labs, and topics across Lattice.
The paper investigates how feature correlations impact superposition in neural networks, challenging the traditional view that superposition primarily introduces interference. They introduce Bag-of-Words Superposition (BOWS) to encode correlated text features, demonstrating that interference can be constructive by arranging features based on co-activation patterns. This arrangement, favored by weight decay, leads to semantic clusters and cyclical structures, offering a new perspective on superposition in realistic language models.
Forget interference as just noise: correlated features in neural networks can constructively superpose to form semantic clusters, especially with weight decay.
A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non-linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co-activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition. Code for this paper can be found at https://github.com/LucasPrietoAl/correlations-feature-geometry.