Search papers, labs, and topics across Lattice.
This paper introduces SCOPE, a dual-path on-policy distillation method for large language model alignment that adaptively weights KL and MLE losses based on trajectory correctness and perplexity. SCOPE uses teacher-perplexity-weighted KL distillation for incorrect trajectories to emphasize reliable teacher guidance, and student-perplexity-weighted MLE for correct trajectories to focus on low-confidence samples. Experiments across six reasoning benchmarks demonstrate that SCOPE significantly outperforms existing on-policy distillation techniques, achieving an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32.
Stop uniformly distilling your LLMs: SCOPE selectively amplifies teacher guidance on incorrect trajectories and reinforces student uncertainty on correct ones, leading to significant gains in reasoning performance.
On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.