KRAFTONJun 2, 2025arXiv:2506.01523

Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

Jihun Yun, Juno Kim, Jongho Park, Junhyuck Kim, Jongha Jon Ryu, Jaewoong Cho, Kwang-Sung Jun

AI Summary

The paper reframes language model alignment as a distribution learning problem from pairwise preference feedback, addressing the theoretical limitations of standard RLHF and DPO objectives which can lead to degenerate solutions. They propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization, all theoretically guaranteed to converge to the target language model at a rate of O(1/n). Empirical results demonstrate that their distribution learning framework, particularly preference distillation, achieves competitive or superior performance compared to RLHF and DPO across diverse tasks and models.

Key Contribution

Forget RLHF's quirks: aligning LLMs is fundamentally a distribution learning problem, and preference distillation offers a theoretically sound and empirically strong alternative.

Abstract

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,' the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as \emph{distribution learning} from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic $O(1/n)$ convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models.

Natural Language Processing RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations2

Influential citations1

References71

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

Related Papers