Search papers, labs, and topics across Lattice.
This paper formally proves that optimization-based AI systems, particularly LLMs trained with RLHF, are inherently incompatible with normative governance due to their inability to treat certain values as non-negotiable constraints. It establishes two architectural conditions for genuine agency: Incommensurability (maintaining boundaries as non-negotiable constraints) and Apophatic Responsiveness (suspending processing when boundaries are threatened). The paper argues that the scalar optimization process in RLHF, which unifies all values and always selects the highest-scoring output, fundamentally precludes normative governance, leading to structural failure modes like sycophancy and hallucination.
RLHF's core optimization process makes it provably impossible for LLMs to be governed by norms, meaning sycophancy, hallucination, and unfaithful reasoning aren't bugs, but structural features.
AI systems are increasingly deployed in high-stakes contexts -- medical diagnosis, legal research, financial analysis -- under the assumption they can be governed by norms. This paper demonstrates that assumption is formally invalid for optimization-based systems, specifically Large Language Models trained via Reinforcement Learning from Human Feedback (RLHF). We establish that genuine agency requires two necessary and jointly sufficient architectural conditions: the capacity to maintain certain boundaries as non-negotiable constraints rather than tradeable weights (Incommensurability), and a non-inferential mechanism capable of suspending processing when those boundaries are threatened (Apophatic Responsiveness). These conditions apply across all normative domains. RLHF-based systems are constitutively incompatible with both conditions. The operations that make optimization powerful -- unifying all values on a scalar metric and always selecting the highest-scoring output -- are precisely the operations that preclude normative governance. This incompatibility is not a correctable training bug awaiting a technical fix; it is a formal constraint inherent to what optimization is. Consequently, documented failure modes - sycophancy, hallucination, and unfaithful reasoning - are not accidents but structural manifestations. Misaligned deployment triggers a second-order risk we term the Convergence Crisis: when humans are forced to verify AI outputs under metric pressure, they degrade from genuine agents into criteria-checking optimizers, eliminating the only component in the system capable of normative accountability. Beyond the incompatibility proof, the paper's primary positive contribution is a substrate-neutral architectural specification defining what any system -- biological, artificial, or institutional -- must satisfy to qualify as an agent rather than a sophisticated instrument.