Search papers, labs, and topics across Lattice.
The authors train sparse autoencoders (SAEs) on the residual stream of a 35B MoE model (Qwen 3.5) to derive steering vectors for agentic behaviors by projecting linear probe weights back through the SAE decoder. They find that steering vectors for five purported agentic traits primarily modulate a single "agency axis," influencing independent action versus user deference. Furthermore, they demonstrate that steering is only effective during the prefill stage, suggesting that behavioral commitments are made early in GatedDeltaNet architectures.
Forget fine-tuning: steer a 35B MoE's agency on the fly with SAE-decoded vectors, revealing a surprisingly simple, one-dimensional control knob.
We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.