FraunhoferInstitute for the Foundations of LearningTU BerlinApr 13, 2026arXiv:2604.11467

From Attribution to Action: A Human-Centered Application of Activation Steering

Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin

AI Summary

This paper introduces an interactive workflow combining Sparse Autoencoder (SAE)-based attribution with activation steering to enable instance-level analysis and debugging of concept usage in vision models. Through expert interviews (N=8) using a web-based tool applied to CLIP, the study examines how practitioners reason about, trust, and apply activation steering for debugging tasks. The key finding is that activation steering facilitates a shift from passive inspection to active intervention-based hypothesis testing, with trust primarily grounded in observed model responses.

Key Contribution

Activation steering turns interpretability into a hands-on debugging tool, but watch out for unintended consequences and limited generalization.

Abstract

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

Computer Vision Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Attribution to Action: A Human-Centered Application of Activation Steering

Related Papers