Search papers, labs, and topics across Lattice.
This study investigates the robustness of Sparse Autoencoders (SAEs) when integrated into the residual streams of various transformer models during inference, without altering model weights or gradients. By evaluating their performance against multiple optimization-based jailbreak attacks, the authors demonstrate that SAE-augmented models can reduce jailbreak success rates by up to 5x compared to undefended models while also minimizing cross-model attack transferability. The results reveal a significant relationship between L0 sparsity and attack success, highlighting a nuanced tradeoff between defense utility and clean performance across different layers of the model.
Integrating Sparse Autoencoders into transformer models can slash jailbreak success rates by up to 5x, reshaping our understanding of model robustness against adversarial attacks.
Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.