DeepMindMar 31, 2026arXiv:2603.30036

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah

AI Summary

This paper introduces a framework for predicting how reward function decomposition into output-dependent and Chain-of-Thought (CoT)-dependent terms affects CoT monitorability after LLM training. By classifying these terms as "aligned," "orthogonal," or "in-conflict," the framework predicts whether training will improve, not affect, or reduce CoT monitorability, respectively. Empirical validation across various RL environments confirms that training with "in-conflict" reward terms reduces CoT monitorability and hinders optimization.

Key Contribution

Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.

Abstract

Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.

Reasoning & Chain-of-Thought RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Related Papers