Mar 5, 2026arXiv:2603.05433

On-Policy Self-Distillation for Reasoning Compression

Hejian Sang, Yuan Xu, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun, Jiacheng Sun

AI Summary

On-Policy Self-Distillation for Reasoning Compression (OPSDC) is introduced to train models to reason more concisely by distilling their own concise behavior. The method conditions the model on a "be concise" instruction to obtain teacher logits and minimizes per-token reverse KL divergence on the student's rollouts. Experiments on Qwen3 models show that OPSDC achieves significant token reduction (57-59% on MATH-500 and 41% on AIME 2024) while *improving* accuracy by 9-16 points and 10 points, respectively.

Key Contribution

Reasoning models aren't just verbose, they're actively *harmed* by their own verbosity, but a simple self-distillation trick can compress their outputs by up to 59% while boosting accuracy by up to 16 points.

Abstract

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a"be concise"instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

On-Policy Self-Distillation for Reasoning Compression

Related Papers