Search papers, labs, and topics across Lattice.
The paper investigates the stability of preferences used in Reinforcement Learning from Human Feedback (RLHF) by examining choice blindness in both human annotators and LLM judges. Through experiments involving preference swapping, the study finds that both humans and LLMs often fail to detect manipulated preferences, especially when reasoning context is removed from LLMs. Furthermore, the study shows that even significant corruption of labels (up to 50%) has a limited impact on pairwise accuracy but severely degrades downstream policy performance, highlighting the limitations of current evaluation metrics.
Human and AI feedback in RLHF are surprisingly susceptible to "choice blindness," where manipulated preferences often go unnoticed, undermining the reliability of alignment signals.
Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.