SEUFeb 20, 2025arXiv:2502.14340

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Ruichen Shao, Bei Li, Gangao Liu, Yang Chen, Xiang Zhou, Jingang Wang, Xunliang Cai, Peng Li

AI Summary

The paper addresses the length bias issue in Direct Preference Optimization (DPO) by introducing a temporal decay factor to prioritize earlier tokens during preference learning. They argue that earlier tokens are more crucial for alignment and that uniform reward treatment across sequences is suboptimal. The proposed method, named Decay-enhanced DPO (D2PO), adaptively weights rewards based on token position, mitigating overfitting and improving responsiveness to human preferences, leading to significant performance gains on AlpacaEval and Arena-Hard benchmarks.

Key Contribution

By recognizing that not all tokens are created equal, D2PO offers a simple temporal weighting fix that boosts DPO alignment scores by up to 9.7 points.

Abstract

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at \url{https://github.com/LotuSrc/D2PO}.

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations7

Influential citations0

References0

Year2025

VenueInternational Conference on Learning Representations

Related Papers

Finding related papers...

Search

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Related Papers