ZJUApr 21, 2026arXiv:2604.19009

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

Linwei Dong, Ruoyu Guo, Ge Bai, Zehuan Yuan, Yawei Luo, Changqing Zou

AI Summary

This paper introduces Gradient-guided Distribution Matching Distillation (GDMD), a novel framework for diffusion distillation that uses reinforcement learning to improve few-step image generation. GDMD addresses the instability of naive RL-based distillation by using DMD gradients as implicit target tensors for reward models, allowing for direct evaluation of distillation updates. Experiments show that GDMD achieves state-of-the-art performance in few-step generation, surpassing the quality of multi-step teachers and previous DMD results.

Key Contribution

Forget noisy samples, RL can now directly optimize the *gradients* of diffusion distillation, leading to SOTA few-step image generation.

Abstract

Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

Related Papers