Eastern Institute of TechnologyHKUSTNingbo Institute of Digital TwinNingbo Key Laboratory of SpatialPolyUJun 8, 2026arXiv:2606.09471

Escaping the KL Agreement Trap in On-Policy Distillation

Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong

AI Summary

This paper addresses the limitations of on-policy distillation (OPD) caused by the low-KL agreement trap, where a teacher's scoring of student-generated rollouts leads to ineffective supervision signals. The authors introduce KAT (KL Agreement Trap Termination), a dynamic termination rule that identifies and mitigates persistent low-KL agreement, enhancing the quality of training signals. Their approach results in significant improvements in accuracy and efficiency across multiple mathematical benchmarks, demonstrating a 2.66% increase in avg@k accuracy and a 59.73% reduction in average rollout length.

Key Contribution

Low-KL agreement can trap models in ineffective training regimes, but KAT offers a dynamic solution that boosts accuracy while slashing rollout lengths.

Abstract

On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.

Inference & Quantization RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Escaping the KL Agreement Trap in On-Policy Distillation

Related Papers