Mar 18, 2026arXiv:2603.18280

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

AI Summary

This paper investigates the limitations of current alignment evaluation methods that focus on concept detection and refusal behavior, arguing that they overlook the crucial "routing" layer where alignment often operates. Through a study of political censorship in Chinese-origin language models, the authors demonstrate that probe accuracy is non-diagnostic, lab-specific routing mechanisms exist, and refusal is no longer the dominant censorship strategy. They propose a three-stage framework (detect, route, generate) to better understand and evaluate alignment.

Key Contribution

Alignment evaluations that only check for dangerous concepts or outright refusals are missing the real action: models are getting sneakier at censorship by steering narratives instead of simply saying "no."

Abstract

Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References1

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Related Papers