KUMunich Center for Machine LearningApr 3, 2025arXiv:2504.02821

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata

AI Summary

The paper investigates the use of Sparse Autoencoders (SAEs) to improve the monosemanticity of neurons in Vision-Language Models (VLMs) like CLIP, aiming to enhance interpretability and steerability. They introduce a user-study-derived benchmark to evaluate neuron-level monosemanticity in visual representations. Results show that SAEs, particularly with high sparsity and wide latents, significantly improve monosemanticity, and interventions on the CLIP vision encoder can steer multimodal LLM outputs.

Key Contribution

Unlocking VLM interpretability, sparse autoencoders let you directly steer multimodal LLMs like LLaVA by intervening on CLIP's vision encoder.

Abstract

Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in visual representations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Further, we demonstrate that applying SAE interventions on CLIP's vision encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying language model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both interpretability and control of VLMs. Code and benchmark data are available at https://github.com/ExplainableML/sae-for-vlm.

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Citation Metrics

Citations20

Influential citations2

References62

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Related Papers