B-itUMDJun 10, 2026arXiv:2606.11599

When is Your LLM Steerable?

Chenrui Fan, Yize Cheng, Ming Li, S. Feizi, Soheil Feizi, Tianyi Zhou

AI Summary

This study investigates the predictability of language model steerability based on early internal states during generation, addressing the challenges of costly grid searches for effective steering configurations. By introducing the ASTEER testbed with 1.4 million steered generations across 150 concepts, the authors analyze how hidden states evolve during initial decoding to inform steering success. A Gradient Boosting Decision Trees classifier trained on these features achieves a macro-F1 score of approximately 0.7, indicating that early hidden states contain valuable information for predicting steering outcomes and optimizing steering strength with reduced computational expense.

Key Contribution

Early hidden states of LLMs can predict steering success with surprising accuracy, enabling efficient steering without exhaustive rollouts.

Abstract

Activation steering offers a lightweight approach to control language models'behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When is Your LLM Steerable?

Related Papers