Jun 15, 2026arXiv:2606.16193

Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLMs

Yusong Zhao, Hengyi Wang, Tanuja Ganu, Akshay Nambi

AI Summary

This paper introduces cascaded sparse autoencoders (CSAEs) to enhance the interpretability of visual representations in multimodal large language models (MLLMs) by enabling the learning of hierarchical visual concepts. Unlike traditional sparse autoencoders that recover flat feature dictionaries, CSAEs train a second-level SAE on the decoder weights of the first-level SAE, allowing for a structured organization of low-level features into higher-level abstractions. Experimental results across multiple visual datasets show that CSAEs significantly improve hierarchical concept coherence and facilitate effective group-level interventions in MLLM outputs compared to state-of-the-art baselines.

Key Contribution

Hierarchical visual concepts learned through cascaded sparse autoencoders could revolutionize how we interpret and manipulate MLLM outputs.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their internal visual representations remain difficult to interpret. Sparse Autoencoders (SAEs) provide a scalable way to decompose dense model activations into sparse, interpretable features. However, existing SAE architectures primarily recover flat feature dictionaries and are less suited for explicit multi-level concept organization. In this paper, we introduce cascaded sparse autoencoders (CSAEs) for learning hierarchical visual concepts in MLLMs. Rather than nesting or stacking SAE sparse activation codes, CSAEs train a second-level SAE directly on the decoder weights of the first-level SAE, treating learned low-level feature directions as inputs for higher-level abstraction. This design enables CSAEs to learn "concepts of concepts" while avoiding drawbacks from the shared-prefix coupling of nesting, Matryoshka-style hierarchies and the bottlenecks of naively stacked SAEs. Experiments across Qwen3-VL, Gemma-3, and LLaVA on multiple visual datasets show that CSAEs improve interpretability in terms of hierarchical concept coherence over state-of-the-art SAE baselines. Results on concept steering further demonstrate that the learned concept groups support effective group-level interventions in MLLM outputs.

Interpretability & Mechanistic Interp Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLMs

Related Papers