HunanJilinSJTUTJUZJUJun 8, 2026arXiv:2606.09304

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, Xiaosong Yuan

AI Summary

This paper introduces Sign-Gated On-Policy Distillation (SG-OPD), which addresses the limitations of traditional on-policy distillation by incorporating a binary verifier to enhance the reliability of teacher signals. By employing phased teacher sampling and a sign-consistency gate, SG-OPD ensures that the student model receives more accurate updates, particularly in scenarios where teacher preferences may be misaligned. Experimental results demonstrate that SG-OPD achieves significant performance improvements on mathematical reasoning benchmarks, outperforming standard OPD by an average of 1.98 and 7.50 points at the per-sample and per-question levels, respectively.

Key Contribution

SG-OPD boosts on-policy distillation performance by leveraging a binary verifier, leading to substantial gains in mathematical reasoning tasks.

Abstract

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.

RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

Related Papers