UofTApr 28, 2026arXiv:2604.25895

Three Models of RLHF Annotation: Extension, Evidence, and Authority

AI Summary

This paper identifies three distinct normative models underlying RLHF annotation: extension (annotators extend designer preferences), evidence (annotators provide independent evidence), and authority (annotators represent broader population preferences). It argues that RLHF pipelines often conflate these models, leading to failures in preference elicitation and aggregation. The paper recommends decomposing annotation into separable dimensions and tailoring pipelines to the most appropriate model for each dimension to improve RLHF effectiveness.

Key Contribution

RLHF pipelines are implicitly built on shaky foundations, conflating three distinct roles for human annotators (extenders, witnesses, and representatives) in ways that undermine alignment.

Abstract

Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.

Constitutional AI & AI Ethics RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Three Models of RLHF Annotation: Extension, Evidence, and Authority

Related Papers