Search papers, labs, and topics across Lattice.
This paper investigates the impact of preference data quality on reward model training for Reinforcement Learning from Human Feedback (RLHF) in Large Language Models (LLMs). They demonstrate, both theoretically and empirically, that preference data with larger and more consistent reward differences leads to lower Mean Square Error (MSE) between expected and empirical loss in reward models. They introduce "Preference Difference" (PD) as a metric to filter preference data, showing that reward models trained on PD-filtered data achieve higher calibrated accuracy and improved RLHF alignment performance, even extending the findings to Direct Preference Optimization (DPO).
Stop wasting compute on noisy preference data: filtering your RLHF datasets by "Preference Difference" boosts reward model accuracy and alignment performance.
Reinforcement Learning from Human Feedback (RLHF) is a commonly used alignment method for Large Language Models (LLMs). This method relies on a reward model trained on a preference dataset to provide scalar rewards. However, the human-annotated preference data is often sparse, noisy, and costly to obtain, necessitating more efficient utilization. This paper proposes a new metric for better preference data utilization from both theoretical and empirical perspectives. Starting with the Bradley-Terry model, we compute the Mean Square Error (MSE) between the expected loss and empirical loss of the reward model. Our findings reveal that data with higher and more consistent difference result in lower MSE. We therefore propose the Preference Difference (PD), the reward difference between two samples, as a filter for preference data. Experimental results on three open-source models show that reward models trained by filtered data with PD achieve higher calibrated accuracy, as well as better RLHF alignment performance. The conclusion remains consistent when we extend the experiments and theoretical derivations to implicit reward alignment algorithms, such as Direct Preference Optimization (DPO).