NJUUCFApr 27, 2026arXiv:2604.24952

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Xinxing Liu, Xinxin Liu, Ming Li, Zonglin Lyu, Yuzhang Shang, Chen Chen

AI Summary

The paper identifies that single-value preference labels in DPO datasets create conflicting gradient signals due to the multi-dimensional nature of human preferences. To mitigate this, they introduce Semi-DPO, a semi-supervised approach that separates preference pairs into clean (consistent) and noisy (conflicting) sets. Semi-DPO then trains on the clean set and uses the resulting model to generate pseudo-labels for the noisy set, iteratively refining the model's alignment with complex human preferences.

Key Contribution

Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.

Abstract

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: https://github.com/L-CodingSpace/semi-dpo

Computer Vision RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations1

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Related Papers