DAMOCUHKHKUSTMay 6, 2026arXiv:2605.05204

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, Harry Yang, Steven Hoi

AI Summary

This paper introduces D-OPSD, a novel on-policy self-distillation training paradigm for continuously fine-tuning step-distilled diffusion models without compromising their few-step inference capabilities. The method leverages the in-context capabilities inherited from LLM/VLM encoders to enable the model to act as both teacher and student, conditioned on different contexts (text vs. multimodal features). By minimizing the distributional difference between the student's and teacher's rollouts, D-OPSD allows for learning new concepts and styles while preserving the efficiency of few-step generation.

Key Contribution

Fine-tuning efficient few-step diffusion models no longer requires sacrificing their speed, thanks to a self-distillation approach that preserves inference capabilities.

Abstract

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimized on the model's own trajectory and under it's own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.

Computer Vision Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Related Papers