Feb 16, 2026arXiv:2602.14687

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data

AI Summary

The paper introduces SynthSAEBench, a toolkit for generating large-scale synthetic datasets with realistic feature characteristics (correlation, hierarchy, superposition) to evaluate Sparse Autoencoders (SAEs). They create a standardized benchmark model, SynthSAEBench-16k, to enable direct comparison of SAE architectures and reproduce previously observed LLM SAE phenomena. Using this benchmark, they identify a new failure mode where Matching Pursuit SAEs exploit superposition noise, highlighting the risk of overfitting with more expressive encoders.

Key Contribution

Matching Pursuit SAEs can exploit superposition noise to improve reconstruction without learning ground-truth features, revealing a critical failure mode in SAE training.

Abstract

Improving Sparse Autoencoders (SAEs) requires benchmarks that can precisely validate architectural innovations. However, current SAE benchmarks on LLMs are often too noisy to differentiate architectural improvements, and current synthetic data experiments are too small-scale and unrealistic to provide meaningful comparisons. We introduce SynthSAEBench, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures. Our benchmark reproduces several previously observed LLM SAE phenomena, including the disconnect between reconstruction and latent quality metrics, poor SAE probing results, and a precision-recall trade-off mediated by L0. We further use our benchmark to identify a new failure mode: Matching Pursuit SAEs exploit superposition noise to improve reconstruction without learning ground-truth features, suggesting that more expressive encoders can easily overfit. SynthSAEBench complements LLM benchmarks by providing ground-truth features and controlled ablations, enabling researchers to precisely diagnose SAE failure modes and validate architectural improvements before scaling to LLMs.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data

Related Papers