Search papers, labs, and topics across Lattice.
This paper introduces a unifying framework for concept alignment in learned representations, addressing the ambiguity in existing methods that optimize different objectives under similar terminology. By decomposing alignment into two axes鈥攚hat is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional)鈥攖he authors reveal the limitations of current approaches and establish four properties of alignment. Their proposed Coupled Sparse Autoencoder (CoSAE) demonstrates that strong instance-level alignment can be achieved with minimal paired data when leveraging distributional objectives, highlighting the multi-objective nature of concept alignment.
Optimizing concept alignment is a multi-objective challenge, and surprisingly, just 0.1% of paired data can yield strong instance-level alignment when done right.
Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.