Bairong Inc.Feb 15, 2026arXiv:2602.14143

ROAST: Rollout-based On-distribution Activation Steering Technique

Xuanbo Su, Hao Luo, Yingfang Zhang, Lijun Zhang

AI Summary

The paper introduces ROAST, a novel activation steering technique for LLMs that leverages on-distribution rollouts to estimate steering directions, mitigating the brittleness of off-distribution supervision. ROAST employs ROC (Rollout-based On-distribution Calibration) to estimate steering directions and addresses the issue of disproportionate activation magnitudes by using Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Experiments across various models (0.6B to 32B) demonstrate that ROAST consistently improves performance on tasks like GSM8K and TruthfulQA, while CSS preserves activation energy more effectively.

Key Contribution

Forget brittle, off-distribution steering: ROAST leverages on-distribution rollouts and normalization to achieve significant gains (+9.7% on GSM8K, +12.1% on TruthfulQA) by carefully balancing activation contributions.

Abstract

Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.

Interpretability & Mechanistic Interp Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ROAST: Rollout-based On-distribution Activation Steering Technique

Related Papers