Jun 10, 2025arXiv:2506.08681

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Phuc Minh Nguyen, Ngoc-Hieu Nguyen, D. M. Nguyen, Anji Liu, An Mai, Binh T. Nguyen, Daniel Sonntag, Khoa D. Doan

AI Summary

The paper addresses the over-optimization problem in Direct Alignment Algorithms (DAAs) like DPO, where the policy drifts from the reference policy, by introducing an importance sampling-based approach (IS-DAAs). IS-DAAs re-weights the DAA objective using an importance ratio between the current and reference policies, clipped to reduce variance. Experiments demonstrate that IS-DAAs effectively mitigates over-optimization, particularly with low regularization, and outperforms existing methods.

Key Contribution

Direct Preference Optimization (DPO) can be rescued from performance collapse with a simple importance sampling fix, especially when regularization is weak.

Abstract

Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. However, these methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as training progresses. This paper proposes a novel importance-sampling approach to mitigate the over-optimization problem of offline DAAs. This approach, called (IS-DAAs), multiplies the DAA objective with an importance ratio that accounts for the reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate over-optimization, especially under low regularization strength, and achieve better performance than other methods designed to address this problem. Our implementations are provided publicly at this link.

RLHF & Preference Learning Scalable Oversight & Alignment Theory Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References37

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Related Papers