MilaAstra FellowshipELLISMax PlanckTübingen AI CenterJun 9, 2026arXiv:2606.11063

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

Joachim Schaeffer, Thomas Jiralerspong, Alexander Panfilov, Guillaume Lajoie, Jonas Geiping, Yoshua Bengio, Roland S. Zimmermann

AI Summary

This paper introduces CIAware-Bench, a benchmark designed to evaluate control intervention (CI) awareness in frontier large language models (LLMs). By assessing whether models can differentiate their own trajectories from those altered by control interventions across various task domains, the study reveals that CI awareness is generally low to moderate, with significant variability depending on the model and task. The findings highlight that CI awareness is not a static characteristic of models, necessitating ongoing evaluation with each new model release to enhance control protocol effectiveness.

Key Contribution

Control interventions are often detected by LLMs, with awareness levels varying significantly across models and tasks, revealing vulnerabilities in AI safety protocols.

Abstract

AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

Related Papers