Mar 31, 2026arXiv:2603.30002

Tracking Equivalent Mechanistic Interpretations Across Neural Networks

AI Summary

This paper tackles the challenge of scaling mechanistic interpretability by defining and formalizing "interpretive equivalence" - determining if two models share a common interpretation without needing to explicitly define that interpretation. They propose that interpretations are equivalent if all their possible implementations are equivalent, and develop an algorithm to estimate this equivalence. The authors provide theoretical guarantees linking algorithmic interpretations, circuits, and representations, offering a foundation for more rigorous MI evaluation.

Key Contribution

Forget painstakingly reverse-engineering individual models; this work offers a way to tell if two different neural nets are secretly running the same algorithm under the hood.

Abstract

Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models' representation similarity. We provide guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.

Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Related Papers