Tsinghua AIApr 14, 2026arXiv:2604.13016

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yuxin Zuo, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Wenkai Yang, Ning Ding, Ning Ding

AI Summary

This paper investigates the training dynamics of on-policy distillation (OPD) for LLMs, identifying two key conditions for success: compatible student/teacher thinking patterns and the teacher offering genuinely new capabilities. Through weak-to-strong reverse distillation, the authors show that same-family 1.5B and 7B teachers can be distributionally indistinguishable from the student's perspective, highlighting the importance of novelty. They further propose off-policy cold start and teacher-aligned prompt selection to recover failing OPD, while also noting potential limitations of OPD for long-horizon distillation.

Key Contribution

OPD's "free lunch" of dense token-level reward may be an illusion, as teacher novelty, not just higher scores, drives successful distillation.

Abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

Inference & Quantization Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Related Papers