Search papers, labs, and topics across Lattice.
This paper introduces UDM-GRPO, a novel reinforcement learning framework for Uniform Discrete Diffusion Models (UDMs) that addresses instability and performance issues when directly applying GRPO. UDM-GRPO leverages two key insights: treating the final clean sample as the action for more stable optimization and reconstructing trajectories via the diffusion forward process for better alignment with the pretraining distribution. Experiments on T2I and OCR tasks demonstrate significant performance improvements, achieving state-of-the-art results with GenEval accuracy increasing from 69% to 96% and OCR accuracy improving from 8% to 57%.
RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.
Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.