Tsinghua AIELLISMax PlanckNTUPKUTU MunichTubingen AI CenterUBCUT AustinJun 16, 2026arXiv:2606.18195

Learning from the Self-future: On-policy Self-distillation for dLLMs

Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu

AI Summary

This paper introduces d-OPSD, an on-policy self-distillation framework specifically designed for diffusion large language models (dLLMs), addressing the limitations of existing autoregressive-centric OPSD methods. By utilizing self-generated answers for suffix conditioning and shifting supervision from token-level to step-level, d-OPSD allows models to learn from their "self future-experience," aligning training with the iterative denoising process. Experimental results demonstrate that d-OPSD significantly enhances sample efficiency, outperforming traditional RLVR and SFT baselines while requiring only about 10% of the optimization steps of RLVR.

Key Contribution

d-OPSD enables dLLMs to learn from their own future outputs, drastically improving sample efficiency and performance in reasoning tasks.

Abstract

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning from the Self-future: On-policy Self-distillation for dLLMs

Related Papers