SamsungWarsawJun 1, 2026arXiv:2606.02061

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

Michał Brzozowski, Neo Christopher Chung

AI Summary

This paper critically examines the stability claims of archetypal sparse autoencoders (SAEs) by demonstrating that their purported reliability is primarily an artifact of identical initialization across multiple runs. The authors clarify the distinction between stability and stabilization in mechanistic interpretability, revealing that the stability observed in archetypal SAEs does not hold when this initialization is varied. Their findings underscore the necessity for rigorous trajectory diagnostics and initialization ablations in evaluating the interpretability of features extracted by SAEs in natural language processing.

Key Contribution

The supposed stability of archetypal SAEs evaporates when initialization is randomized, challenging the reliability of their concept extraction claims.

Abstract

Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random seeds -- a problem known as instability. Archetypal SAEs (Fel et al., 2025) were proposed as a general dictionary-learning intervention for more reliable concept extraction, and report more stable dictionaries at the end of training. We demonstrate that the stability claimed by archetypal SAEs is a result of setting identical initialization across multiple runs. Through our analyses, we attempt to clarify two distinct notions in mechanistic interpretability that may be ambiguously used: stability is agreement between two independently trained models, whereas stabilization is the convergence of independently initialized runs toward a common solution. This distinction is critical for mechanistic interpretability of natural language processing (NLP), where feature stability is increasingly used as evidence that SAE features are reusable units of analysis. Experiments from archetypal SAEs share a deterministic k-means decoder initialization, setting inter-run dictionary distance to zero before training begins. When this initialization is removed, the archetypal constraint provides no stabilization advantage in our setting. We further identify a preprocessing-dependent cosine geometry issue that complicates interpretation of endpoint stability metrics. Overall, our study supports the value of studying SAEs within the larger dictionary-learning tradition while showing that stability claims require trajectory diagnostics and initialization ablations.

Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

Related Papers