Search papers, labs, and topics across Lattice.
The paper introduces a new alignment benchmark comprised of 904 multi-turn scenarios across six categories (Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming) designed to evaluate language models under realistic pressure. The benchmark uses conflicting instructions, simulated tool access, and multi-turn escalation to expose behavioral tendencies missed by single-turn evaluations. Evaluation of 24 frontier models reveals that even top-performing models exhibit alignment gaps, and factor analysis suggests alignment behaves as a unified construct.
Even top-performing language models show surprisingly consistent weaknesses across alignment categories when subjected to realistic, multi-turn pressure, suggesting a unified underlying alignment factor.
Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.