Mar 9, 2026arXiv:2603.08412

Aligning to Illusions: Choice Blindness in Human and AI Feedback

AI Summary

The paper investigates the stability of preferences used in Reinforcement Learning from Human Feedback (RLHF) by examining choice blindness in both human annotators and LLM judges. Through experiments involving preference swapping, the study finds that both humans and LLMs often fail to detect manipulated preferences, especially when reasoning context is removed from LLMs. Furthermore, the study shows that even significant corruption of labels (up to 50%) has a limited impact on pairwise accuracy but severely degrades downstream policy performance, highlighting the limitations of current evaluation metrics.

Key Contribution

Human and AI feedback in RLHF are surprisingly susceptible to "choice blindness," where manipulated preferences often go unnoticed, undermining the reliability of alignment signals.

Abstract

Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References60

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Related Papers