Ant GroupCUHKEastern Institute of TechnologyZJUJun 4, 2026arXiv:2606.06021

OPRD: On-Policy Representation Distillation

Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen

AI Summary

This paper introduces On-Policy Representation Distillation (OPRD), which enhances the traditional on-policy distillation method by aligning student and teacher representations in hidden-state space rather than solely in output space. By doing so, OPRD mitigates the sampling variance associated with Monte Carlo KL estimates and leverages richer structural information from the teacher model. The empirical results demonstrate that OPRD significantly narrows the performance gap between student and teacher models while improving training efficiency, achieving 1.44x faster training and 54% lower memory usage compared to existing methods.

Key Contribution

OPRD closes the performance gap between student and teacher models while training 1.44x faster and using 54% less memory than traditional methods.

Abstract

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OPRD: On-Policy Representation Distillation

Related Papers