HKUSTTencent AIWestlakeMay 25, 2026arXiv:2605.26108

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang, Tianyu Pang

AI Summary

The paper introduces Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework for aligning few-step diffusion models with human preferences by unifying distribution matching distillation with reward-guided reinforcement learning. RTDMD minimizes the KL divergence to a reward-tilted teacher distribution, decomposing it into distribution matching (via Ambient-Consistent Distribution Matching Distillation) and reward maximization (using a hybrid policy gradient with step-subset GRPO). Experiments show RTDMD achieves state-of-the-art results on SD3, SD3.5, and FLUX.2 with only 4 inference steps across preference, aesthetic, and compositional metrics.

Key Contribution

Forget slow, complex training: you can now distill diffusion models to just 4 steps and still beat the state-of-the-art in preference alignment, aesthetics, and composition.

Abstract

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

Computer Vision Inference & Quantization RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Related Papers