Search papers, labs, and topics across Lattice.
This paper introduces an interactive workflow combining Sparse Autoencoder (SAE)-based attribution with activation steering to enable instance-level analysis and debugging of concept usage in vision models. Through expert interviews (N=8) using a web-based tool applied to CLIP, the study examines how practitioners reason about, trust, and apply activation steering for debugging tasks. The key finding is that activation steering facilitates a shift from passive inspection to active intervention-based hypothesis testing, with trust primarily grounded in observed model responses.
Activation steering turns interpretability into a hands-on debugging tool, but watch out for unintended consequences and limited generalization.
Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.