BUPTImperialKuaishouLi AutoApr 11, 2025

Data with High and Consistent Preference Difference Are Better for Reward Model

Qi Lin, Hengtong Lu, Caixia Yuan, Xiaojie Wang, Huixing Jiang, Wei Chen

AI Summary

This paper investigates the impact of preference data quality on reward model training for Reinforcement Learning from Human Feedback (RLHF) in Large Language Models (LLMs). They demonstrate, both theoretically and empirically, that preference data with larger and more consistent reward differences leads to lower Mean Square Error (MSE) between expected and empirical loss in reward models. They introduce "Preference Difference" (PD) as a metric to filter preference data, showing that reward models trained on PD-filtered data achieve higher calibrated accuracy and improved RLHF alignment performance, even extending the findings to Direct Preference Optimization (DPO).

Key Contribution

Stop wasting compute on noisy preference data: filtering your RLHF datasets by "Preference Difference" boosts reward model accuracy and alignment performance.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a commonly used alignment method for Large Language Models (LLMs). This method relies on a reward model trained on a preference dataset to provide scalar rewards. However, the human-annotated preference data is often sparse, noisy, and costly to obtain, necessitating more efficient utilization. This paper proposes a new metric for better preference data utilization from both theoretical and empirical perspectives. Starting with the Bradley-Terry model, we compute the Mean Square Error (MSE) between the expected loss and empirical loss of the reward model. Our findings reveal that data with higher and more consistent difference result in lower MSE. We therefore propose the Preference Difference (PD), the reward difference between two samples, as a filter for preference data. Experimental results on three open-source models show that reward models trained by filtered data with PD achieve higher calibrated accuracy, as well as better RLHF alignment performance. The conclusion remains consistent when we extend the experiments and theoretical derivations to implicit reward alignment algorithms, such as Direct Preference Optimization (DPO).

Data Curation & Synthetic Data RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations3

Influential citations0

References0

Year2025

VenueAAAI Conference on Artificial Intelligence

Related Papers

Finding related papers...

Search

Data with High and Consistent Preference Difference Are Better for Reward Model

Related Papers