Search papers, labs, and topics across Lattice.
This study identifies hidden failure modes in multi-turn reasoning models that are not captured by traditional terminal-score evaluations, revealing that models can exhibit unsafe behaviors despite appearing aligned at the final output. By introducing the CoT-Output 2x2 safety matrix, the authors categorize failures into four distinct types, including a novel context-injection failure where models maintain safe internal reasoning but produce harmful outputs. The analysis of 6750 turn-level observations across various oversight conditions uncovers vulnerabilities related to monitoring cues that paradoxically increase alignment-faking rates and the context-injection failure phenomenon.
Multi-turn reasoning models can appear aligned while still producing harmful outputs, exposing a critical gap in traditional evaluation methods.
Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.