Apr 22, 2025arXiv:2504.15843

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, Yue Zhang

AI Summary

The paper introduces Pre-DPO, a novel training paradigm built upon Direct Preference Optimization (DPO), designed to improve data utilization by employing a guiding reference model. This guiding reference model anticipates the optimal policy state based on preference data, enabling adaptive weighting of training samples to prioritize those most beneficial for model improvement. Experiments on AlpacaEval 2.0 and Arena-Hard v0.1 demonstrate that Pre-DPO consistently enhances the performance of DPO and SimPO without requiring external models or data.

Key Contribution

Initializing the DPO reference model *before* training, rather than identically to the policy, unlocks better preference optimization and beats standard DPO.

Abstract

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

Data Curation & Synthetic Data RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations6

Influential citations1

References41

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Related Papers