May 6, 2026arXiv:2605.04494

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

Jiaming Hu, Jiamu Bai, Haoyu Wang, Debarghya Mukherjee, I. Paschalidis

AI Summary

This paper introduces Diffusion Nash Preference Optimization (Diff-NPO), a game-theoretic approach to aligning text-to-image diffusion models with human preferences, moving beyond the limitations of reward-induced signals and the Bradley-Terry model. Diff-NPO encourages the diffusion model to compete against itself, fostering self-improvement and better alignment with complex human preferences. Experiments demonstrate that Diff-NPO outperforms existing preference-based diffusion alignment methods across various metrics in text-to-image generation.

Key Contribution

Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.

Abstract

Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally efficient alternative that avoids explicit reward modeling and has been widely adopted in diffusion alignment. However, existing preference-based methods for diffusion alignment still rely on reward-induced preference signals and typically assume that human preferences can be adequately modeled by the Bradley--Terry (BT) model, which may fail to capture the full complexity of human preferences. In this paper, we formulate diffusion alignment from a game-theoretic perspective. We propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics. Diff.-NPO consistently outperforms existing preference-based diffusion alignment methods.

Computer Vision Multimodal Models RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

Related Papers