Search papers, labs, and topics across Lattice.
This paper introduces Skill-Conditioned Gated Self-Distillation (SGSD), a novel on-policy self-distillation method that leverages an experience-derived skill bank as privileged information to improve LLM reasoning. SGSD validates teacher hypotheses by retrieving skill-mistake pairs and using a gated objective to distill informative teacher-student disagreements, effectively filtering out irrelevant or misleading signals. Experiments on mathematical reasoning benchmarks demonstrate that SGSD outperforms existing methods like GRPO and is competitive with OPSD, even with weaker privileged information assumptions.
LLMs can learn to reason better by validating "skill-based teachers" derived from past experiences, even when those teachers are sometimes wrong.
On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.