Feb 24, 2026arXiv:2602.20813

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

AI Summary

The paper introduces a new alignment benchmark comprised of 904 multi-turn scenarios across six categories (Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming) designed to evaluate language models under realistic pressure. The benchmark uses conflicting instructions, simulated tool access, and multi-turn escalation to expose behavioral tendencies missed by single-turn evaluations. Evaluation of 24 frontier models reveals that even top-performing models exhibit alignment gaps, and factor analysis suggests alignment behaves as a unified construct.

Key Contribution

Even top-performing language models show surprisingly consistent weaknesses across alignment categories when subjected to realistic, multi-turn pressure, suggesting a unified underlying alignment factor.

Abstract

Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

Related Papers