Tsinghua AIMar 17, 2026arXiv:2603.16417

Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

AI Summary

This paper provides a theoretical explanation for the empirical success of training LLMs with negative-only feedback, contrasting it with standard RLHF. It argues that positive preferences are continuously coupled and context-dependent, leading to sycophancy, while negative constraints are discrete, finite, and independently verifiable, allowing for stable boundaries. The paper grounds this asymmetry in falsification logic and suggests a shift in alignment research towards learning what humans reject.

Key Contribution

Negative constraints offer a surprisingly robust path to AI alignment, sidestepping the sycophancy issues inherent in preference-based RLHF.

Abstract

Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences ("which is better") encode continuously coupled, context-dependent human values that cannot be exhaustively specified -- leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints ("what is wrong") encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry -- rooted in Popper's falsification logic and the epistemology of negative knowledge -- explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from "learning what humans prefer" to "learning what humans reject," and offer testable predictions for this framework.

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

Related Papers