Search papers, labs, and topics across Lattice.
The paper introduces Importance Weighted Preference Optimization (IWPO), a novel offline learning framework for LLMs that addresses limitations in existing Direct Preference Optimization (DPO) methods. IWPO incorporates Importance Resampling-based Preferred Sample Sampling (IRPSS) to better approximate Maximum Likelihood Estimation (MLE) and a Multi-Cluster-based Reward Model (MCRM) to mitigate reward hacking from overlapping preference levels. Experiments demonstrate that IWPO, built on standard DPO, improves performance on AlpacaEval and MT-Bench benchmarks when applied to the Mistral-7B model.
IWPO tackles reward hacking and suboptimal policy distribution in DPO by weighting samples based on their adherence to the optimal policy, leading to significant gains in LLM performance.
Direct preference optimization (DPO) methods for Large Language Models (LLMs) have emerged as an efficient alternative to Reinforcement Learning from Human Feedback (RLHF), owing to the lightweight training pipeline and lower computational cost. However, these methods still face significant challenges in enhancing the capacity for generalized text generation of language models. First, offline approaches remove the reward model (RM) and learn directly from preferences; while this simplifies training, it also eliminates an effective check on whether samples follow the optimal policy distribution, thereby deviating from standard Maximum Likelihood Estimation (MLE). Second, the preference levels across tasks or attributes often overlap in mixed datasets, which can induce catastrophic reward hacking. To address these issues, we introduce a new offline learning framework that explicitly accounts for how strongly each sample conforms to the optimal policy. We propose an Importance Resampling-based Preferred Sample Sampling (IRPSS) algorithm to recover standard MLE for estimating the optimal policy, and introduce a Multi-Cluster-based Reward Model (MCRM) that leverages feature clustering to mitigate reward distribution overlap during training. Combining these components, we present a Sample Importance Weight-based Human Preference Optimization (IWPO) method, which emphasizes the importance of target samples and can be plugged into mainstream preference optimization methods. Built on standard DPO, we evaluate IWPO across multiple base models and datasets. On Mistral-7B model, IWPO improves AlpacaEval win rate against GPT-4 Turbo and text-davinci-003 by 6.3% and MT-Bench multi-turn score by 5.1%.