Mar 17, 2026arXiv:2603.16335

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

AI Summary

The authors train sparse autoencoders (SAEs) on the residual stream of a 35B MoE model (Qwen 3.5) to derive steering vectors for agentic behaviors by projecting linear probe weights back through the SAE decoder. They find that steering vectors for five purported agentic traits primarily modulate a single "agency axis," influencing independent action versus user deference. Furthermore, they demonstrate that steering is only effective during the prefill stage, suggesting that behavioral commitments are made early in GatedDeltaNet architectures.

Key Contribution

Forget fine-tuning: steer a 35B MoE's agency on the fly with SAE-decoded vectors, revealing a surprisingly simple, one-dimensional control knob.

Abstract

We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Related Papers