CUHKWestlakeJun 1, 2026arXiv:2606.02521

Drifting Preference Optimization for One-Step Generative Models

AI Summary

This paper introduces Drifting Preference Optimization (DrPO), an innovative online preference-finetuning method designed specifically for one-step text-to-image generators. By leveraging a non-parametric dipole preference field and a reference drift from a frozen base generator, DrPO effectively synthesizes feature-space updates without the need for traditional reward-gradient backpropagation. The method demonstrates significant improvements in alignment and computational efficiency, achieving a 3.51x reduction in training computation on HPSv3 benchmarks compared to existing baselines.

Key Contribution

DrPO achieves superior alignment in one-step generative models while slashing training computation costs by over 3x, challenging the status quo of preference finetuning.

Abstract

One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denoising trajectories, differentiable reward gradients, or test-time optimization. We propose Drifting Preference Optimization (DrPO), an online preference-finetuning method for deterministic one-step generators. For each prompt, DrPO samples candidates from the current generator, ranks them with a target reward, and uses high- and low-scoring samples to synthesize a feature-space update direction. The update is a non-parametric dipole preference field plus a reference drift estimated from the frozen base generator, and is optimized through a detached feature-space regression target. The target reward is used only for ranking, so DrPO can train with large, black-box, or non-differentiable rewards while inference remains a single generator call. We evaluate DrPO on SD-Turbo and SDXL-Turbo with multiple target rewards and benchmarks, including HPSv3 and GenEval. DrPO improves alignment over reward-gradient-free one-step preference baselines and reduces HPSv3 training computation by $3.51\times$ under the matched effective-batch setting by removing reward-model backpropagation. Initial offline experiments suggest that sample-based gradient synthesis can also be used beyond online reward ranking.

Multimodal Models RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Drifting Preference Optimization for One-Step Generative Models

Related Papers