Tsinghua AIBAAIBUPTApr 20, 2026arXiv:2604.18518

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Yang Liu, Chengyu Wang, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang Beijing University of Posts, Xinlong Wang, Telecommunications, Beijing Academy of Artificial Intelligence

AI Summary

This paper introduces UDM-GRPO, a novel reinforcement learning framework for Uniform Discrete Diffusion Models (UDMs) that addresses instability and performance issues when directly applying GRPO. UDM-GRPO leverages two key insights: treating the final clean sample as the action for more stable optimization and reconstructing trajectories via the diffusion forward process for better alignment with the pretraining distribution. Experiments on T2I and OCR tasks demonstrate significant performance improvements, achieving state-of-the-art results with GenEval accuracy increasing from 69% to 96% and OCR accuracy improving from 8% to 57%.

Key Contribution

RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.

Abstract

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.

Architecture Design (Transformers, SSMs, MoE)Computer Vision RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Related Papers