SUSTechJun 11, 2025arXiv:2506.09457

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

AI Summary

This paper identifies a "reward-generation gap" in Direct Alignment Algorithms (DAAs) stemming from a mismatch between the importance of prefix tokens during generation and their representation in DAA's implicit reward functions. To address this, they propose Prefix-Oriented Equal-length Training (POET), which truncates preferred and dispreferred responses to equal lengths during training, implicitly constraining optimization across all timesteps of the token-level MDP. Experiments with DPO and SimPO demonstrate that POET improves performance on AlpacaEval 2 and downstream tasks, highlighting the importance of aligning reward optimization with generation performance.

Key Contribution

LLMs can be coaxed into better alignment with human preferences by simply truncating training responses to equal lengths, forcing the model to focus on the crucial prefix tokens often overlooked by standard Direct Alignment Algorithms.

Abstract

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the"reward-generation gap"-- a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we adopt a token-level MDP perspective of DAAs to analyze its limitations and introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one's length. Training with \mname, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all timesteps of token-level MDP, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References49

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Related Papers