Feb 10, 2026arXiv:2602.09870

Steer2Edit: From Activation Steering to Component-Level Editing

Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng

AI Summary

The paper introduces Steer2Edit, a novel framework that translates steering vectors, typically used for inference-time activation interventions in large language models, into diagnostic signals for component-level rank-1 weight editing. This approach selectively redistributes behavioral influence across attention heads and MLP neurons, addressing the limitations of global activation interventions that often lead to attribute-utility trade-offs. Experiments across safety, truthfulness, and reasoning efficiency demonstrate that Steer2Edit achieves better attribute-utility trade-offs compared to standard steering, improving safety by up to 17.2%, truthfulness by 9.8%, and reducing reasoning length by 12.2% while preserving downstream performance.

Key Contribution

Ditch the blunt hammer of global activation steering: Steer2Edit surgically edits LLM behavior by pinpointing and tweaking the specific attention heads and MLP neurons responsible.

Abstract

Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.

Interpretability & Mechanistic Interp Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Steer2Edit: From Activation Steering to Component-Level Editing

Related Papers