May 6, 2025arXiv:2505.06273

Policy-labeled Preference Learning: Is Preference Enough for RLHF?

Taehyun Cho, Seokhun Ju, Seung Han, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee

AI Summary

This paper introduces Policy-labeled Preference Learning (PPL), a novel RLHF method that addresses likelihood mismatch issues by modeling human preferences with regret, incorporating behavior policy information. PPL uses a regret-based framework to model human preferences, mitigating the assumption that trajectories are generated by an optimal policy, which is often inaccurate in standard RLHF. The authors demonstrate that PPL, augmented with a contrastive KL regularization derived from regret principles, significantly improves offline RLHF performance and shows effectiveness in online settings on high-dimensional continuous control tasks.

Key Contribution

Correcting for suboptimal behavior during preference learning unlocks substantial gains in offline RLHF and improves online performance in continuous control tasks.

Abstract

To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. Inspired by Direct Preference Optimization framework which directly learns optimal policy without explicit reward, we propose policy-labeled preference learning (PPL), to resolve likelihood mismatch issues by modeling human preferences with regret, which reflects behavior policy information. We also provide a contrastive KL regularization, derived from regret-based principles, to enhance RLHF in sequential decision making. Experiments in high-dimensional continuous control tasks demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.

RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References41

Year2025

VenueInternational Conference on Machine Learning

Related Papers

Finding related papers...

Search

Policy-labeled Preference Learning: Is Preference Enough for RLHF?

Related Papers