Stanford HAIHarvardNortheasternUCLApr 30, 2026arXiv:2604.28119

Do Sparse Autoencoders Capture Concept Manifolds?

Usha Bhalla, Usha Bhalla, Thomas Fel, Thomas Fel, Can Rager, Can Rager, Sheridan Feucht, Sheridan Feucht, Tal Haklay, Tal Haklay, Daniel Wurgaft, Daniel Wurgaft, S. Boppana, Siddharth Boppana, Matthew Kowal, M. Kowal, Vasudev Shyam, Vasu Shyam, Jack Merullo, Jack Merullo, Atticus Geiger, Atticus Geiger, Ekdeep Singh Lubana, E. Lubana

AI Summary

This paper investigates how sparse autoencoders (SAEs) capture concept manifolds, challenging the common assumption that concepts align with independent linear directions. They develop a theoretical framework showing SAEs can capture manifolds globally (compact group of atoms) or locally (distributed tiling). Empirically, SAEs exhibit "dilution," a suboptimal mix of global and local strategies, hindering manifold visibility at the individual concept level.

Key Contribution

Sparse autoencoders, despite their popularity for extracting interpretable features, often fail to capture the underlying manifold structure of concepts, instead fragmenting them across multiple, diluted features.

Abstract

Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that SAEs suboptimally recover continuous structures, mixing the global subspace and local tiling solutions in a fragmented regime we call dilution. This explains why manifold structure is rarely visible at the level of individual concepts and motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions. More broadly, our results suggest that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Do Sparse Autoencoders Capture Concept Manifolds?

Related Papers