Search papers, labs, and topics across Lattice.
This paper introduces a framework for predicting how reward function decomposition into output-dependent and Chain-of-Thought (CoT)-dependent terms affects CoT monitorability after LLM training. By classifying these terms as "aligned," "orthogonal," or "in-conflict," the framework predicts whether training will improve, not affect, or reduce CoT monitorability, respectively. Empirical validation across various RL environments confirms that training with "in-conflict" reward terms reduces CoT monitorability and hinders optimization.
Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.
Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.