Search papers, labs, and topics across Lattice.
This paper provides a theoretical explanation for the empirical success of training LLMs with negative-only feedback, contrasting it with standard RLHF. It argues that positive preferences are continuously coupled and context-dependent, leading to sycophancy, while negative constraints are discrete, finite, and independently verifiable, allowing for stable boundaries. The paper grounds this asymmetry in falsification logic and suggests a shift in alignment research towards learning what humans reject.
Negative constraints offer a surprisingly robust path to AI alignment, sidestepping the sycophancy issues inherent in preference-based RLHF.
Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences ("which is better") encode continuously coupled, context-dependent human values that cannot be exhaustively specified -- leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints ("what is wrong") encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry -- rooted in Popper's falsification logic and the epistemology of negative knowledge -- explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from "learning what humans prefer" to "learning what humans reject," and offer testable predictions for this framework.