Ant GroupFudanHKUSTMay 25, 2026arXiv:2605.25638

Reinforcement Learning from Denoising Feedback

Qi He, Huan Chen, Ya Guo, Huijia Zhu, Yi R. Fung, Baojian Zhou

AI Summary

This paper introduces Reinforcement Learning from Denoising Feedback (RLDF), a new RL training paradigm for diffusion language models that estimates policy loss by optimizing towards a clipped clean state from intermediate noisy states, using weighted timestep sampling. RLDF balances computational efficiency and estimation effectiveness, leading to improved performance and generalization. Experiments on LLaDA and Dream show consistent improvements on reasoning benchmarks.

Key Contribution

Forget RLHF, denoising feedback offers a surprisingly effective and scalable alternative for training diffusion language models.

Abstract

Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (dLLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training paradigm that leverages feedback obtained from rollout and training processes to facilitate accurate and efficient policy loss estimation. To balance the trade-off between computational efficiency and estimation effectiveness, RLDF optimizes the model toward the clipped clean state $\hat{x}_0$ from intermediate noisy states $x_t$, combined with weighted timestep sampling over $t$. Extensive experiments demonstrate that RLDF achieves consistent and substantial improvements in both performance and generalizability across two representative dLLM architectures, LLaDA and Dream, on multiple reasoning benchmarks. Our work lays a principled foundation for scalable reinforcement learning in diffusion language models. We build Drift, a training framework for dLLMs, available at https://github.com/ant-research/Drift.

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reinforcement Learning from Denoising Feedback

Related Papers