Search papers, labs, and topics across Lattice.
This paper addresses the limitations of on-policy distillation (OPD) caused by the low-KL agreement trap, where a teacher's scoring of student-generated rollouts leads to ineffective supervision signals. The authors introduce KAT (KL Agreement Trap Termination), a dynamic termination rule that identifies and mitigates persistent low-KL agreement, enhancing the quality of training signals. Their approach results in significant improvements in accuracy and efficiency across multiple mathematical benchmarks, demonstrating a 2.66% increase in avg@k accuracy and a 59.73% reduction in average rollout length.
Low-KL agreement can trap models in ineffective training regimes, but KAT offers a dynamic solution that boosts accuracy while slashing rollout lengths.
On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.