Tsinghua AIUPennApr 25, 2026arXiv:2604.23380

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

Bingda Tang, Bingda Tang, Yuhui Zhang, Yuhui Zhang, Xiaohan Wang, Xiaohan Wang, Jiayuan Mao, Jiayuan Mao, Ludwig Schmidt, Ludwig Schmidt, Serena Yeung-Levy, S. Yeung-Levy

AI Summary

This paper addresses the challenge of aligning denoising generative models with human preferences using online reinforcement learning. They identify that ELBO-based policy gradient methods, previously considered unstable and inefficient, can outperform MDP-based methods with variance reduction and controlled gradient steps. The authors introduce Variational GRPO (V-GRPO), integrating ELBO surrogates with the Group Relative Policy Optimization (GRPO) algorithm, achieving state-of-the-art results in text-to-image synthesis with significant speedups.

Key Contribution

ELBO-based RL, previously dismissed for generative model alignment, can actually beat MDP-based methods with the right tricks.

Abstract

Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a 2times speedup over MixGRPO and a 3times speedup over DiffusionNFT.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

Related Papers