Apr 20, 2026arXiv:2604.17698

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

AI Summary

This study investigates the geometric stability of language model representations to predict steerability and detect internal drift, revealing a shared geometric foundation between these two capabilities. By employing supervised Shesha variants, the authors achieve near-perfect accuracy in predicting steerability across multiple models and tasks, while demonstrating that unsupervised stability is more effective for drift detection. The findings highlight a critical dissociation where task-aligned stability is essential for controllability, contrasting with the unsupervised approach's superior performance in identifying structural changes post-deployment.

Key Contribution

Predicting steerability with near-perfect accuracy while detecting drift more effectively than existing methods could transform how we monitor and control language models in real-world applications.

Abstract

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.

Interpretability & Mechanistic Interp Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Related Papers