AIRIISP RAS Research Center for Trusted AISkoltechFeb 15, 2026arXiv:2602.14111

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Anton Korznikov, Andrey V. Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan V. Oseledets, Elena Tutubalina

AI Summary

The paper investigates whether sparse autoencoders (SAEs) recover meaningful features from neural network activations, a task for which they are increasingly used. Through experiments on synthetic data with ground-truth features, the authors find that SAEs recover only a small fraction of true features despite high explained variance. Comparing SAE performance to random baselines on real activations, the study demonstrates that these baselines achieve comparable performance in interpretability, sparse probing, and causal editing, suggesting that current SAEs do not reliably decompose models' internal mechanisms.

Key Contribution

Sparse autoencoders, hyped as a key interpretability tool, may not be learning much more than random feature sets, casting doubt on their ability to decompose model internals.

Abstract

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only $9\%$ of true features despite achieving $71\%$ explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models'internal mechanisms.

Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References45

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Related Papers