FraunhoferKAISTNortheasternJun 9, 2026arXiv:2606.11172

Predicting Future Behaviors in Reasoning Models Enables Better Steering

Evgenii Kortukov, Piotr Komorowski, Florian Klein, Paula Engl, Gabriele Sarti, Seong Joon Oh, Sebastian Lapuschkin, Wojciech Samek

AI Summary

This paper critiques existing test-time steering methods for large reasoning models (LRMs), which often rely on ineffective internal features that only detect previously generated behaviors. Instead, the authors introduce a novel approach using activation probes that predict future behavior likelihoods from intermediate reasoning steps, achieving accuracy rates between 64% and 91%. By implementing a text-level steering method called Future Probe Controlled Generation (FPCG), they demonstrate that steering can be achieved with minimal degradation in output quality, significantly improving the control over LRM behaviors in various scenarios.

Key Contribution

Activation probes can predict future behaviors in reasoning models with up to 91% accuracy, enabling effective steering without sacrificing output quality.

Abstract

Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target. Instead, we train activation probes to predict future behavior likelihoods from intermediate reasoning steps. These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features. Building on these prediction features, we introduce a text-level steering method, Future Probe Controlled Generation. FPCG samples multiple candidate sentences and chooses the best one according to a probe predicting the future behavior likelihood. This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails. These results show that distinguishing detection and prediction features enables a more nuanced approach to controlling LRM behaviors.

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Predicting Future Behaviors in Reasoning Models Enables Better Steering

Related Papers