Search papers, labs, and topics across Lattice.
This paper investigates the training dynamics of on-policy distillation (OPD) for LLMs, identifying two key conditions for success: compatible student/teacher thinking patterns and the teacher offering genuinely new capabilities. Through weak-to-strong reverse distillation, the authors show that same-family 1.5B and 7B teachers can be distributionally indistinguishable from the student's perspective, highlighting the importance of novelty. They further propose off-policy cold start and teacher-aligned prompt selection to recover failing OPD, while also noting potential limitations of OPD for long-horizon distillation.
OPD's "free lunch" of dense token-level reward may be an illusion, as teacher novelty, not just higher scores, drives successful distillation.
On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.