Stanford HAIHarvardNortheasternUCLMay 6, 2026arXiv:2605.05115

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

Daniel Wurgaft, Can Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas McGrath, Owen Lewis, Jack Merullo, Noah Goodman, Thomas Fel, Atticus Geiger, Ekdeep Singh Lubana

AI Summary

This paper investigates the causal relationship between the geometry of neural network representations and the resulting behavior by intervening in activation space. They fit manifolds to both activations ($M_h$) and output probabilities ($M_y$) and demonstrate that steering along $M_h$ produces behavioral trajectories that align with $M_y$, unlike linear steering. Furthermore, optimizing interventions to produce paths along $M_y$ recovers activation trajectories that follow the curvature of $M_h$, highlighting a bidirectional relationship across language and video models.

Key Contribution

Steering neural networks through the intrinsic geometry of their activations unlocks more natural and controllable behaviors than traditional linear interventions.

Abstract

Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold $M_h$ to representations and a behavior manifold $M_y$ to output probability distributions. We then test the link $M_h \leftrightarrow M_y$ via interventions: we find that steering along $M_h$, which we term manifold steering, yields behavioral trajectories that follow $M_y$, while linear steering -- which assumes a Euclidean geometry -- cuts through off-manifold regions and hence produces unnatural outputs. Moreover, optimizing interventions in activation space to produce paths along $M_y$ recovers activation trajectories that trace the curvature of $M_h$. We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities. In language models, we use reasoning tasks with cyclic and sequential geometries as well as in-context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics. Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals. This recasts the core problem of steering from finding the right direction to finding the right geometry.

Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

Related Papers