BrownDEEL -IRT Saint Exupéry e GoodfireENSJun 8, 2026arXiv:2606.09653

A Unifying Framework for Concept-Based Representational Similarity

Grégoire Dhimoïla, Victor Boutin, Agustin Martin Picard, Thomas Fel, Thomas Serre

AI Summary

This paper introduces a unifying framework for concept alignment in learned representations, addressing the ambiguity in existing methods that optimize different objectives under similar terminology. By decomposing alignment into two axes—what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional)—the authors reveal the limitations of current approaches and establish four properties of alignment. Their proposed Coupled Sparse Autoencoder (CoSAE) demonstrates that strong instance-level alignment can be achieved with minimal paired data when leveraging distributional objectives, highlighting the multi-objective nature of concept alignment.

Key Contribution

Optimizing concept alignment is a multi-objective challenge, and surprisingly, just 0.1% of paired data can yield strong instance-level alignment when done right.

Abstract

Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.

Multimodal Models Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Unifying Framework for Concept-Based Representational Similarity

Related Papers