Jan 7, 2025arXiv:2501.03884

AlphaPO: Reward Shape Matters for LLM Alignment

Aman Gupta, Shao Tang, Qingquan Song, Sirou Zhu, Jiwoo Hong, Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Siyu Zhu, Parag Agrawal, Natesh S. Pillai, S. Keerthi

AI Summary

The paper investigates the impact of reward function shape on the performance of Direct Alignment Algorithms (DAAs) for aligning large language models (LLMs). It identifies that existing DAAs like DPO and SimPO suffer from likelihood displacement due to their reward function shapes. To address this, the authors introduce AlphaPO, a novel DAA method that incorporates an $\alpha$-parameter to control the reward function shape, mitigating likelihood displacement and over-optimization, leading to improved alignment performance on Mistral-7B and Llama3-8B.

Key Contribution

AlphaPO unlocks 7-50% better LLM alignment by showing that reward function *shape* is a surprisingly powerful lever in Direct Alignment Algorithms.

Abstract

Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Some popular examples of DAAs include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce \textbf{AlphaPO}, a new DAA method that leverages an $\alpha$-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7\% to 10\% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B while achieving 15\% to 50\% relative improvement over DPO on the same models. The analysis and results presented highlight the importance of the reward shape and how one can systematically change it to affect training dynamics, as well as improve alignment performance.

Constitutional AI & AI Ethics RLHF & Preference Learning

Citation Metrics

Citations11

Influential citations1

References42

Year2025

VenueInternational Conference on Machine Learning

Related Papers

Finding related papers...

Search

AlphaPO: Reward Shape Matters for LLM Alignment

Related Papers