Microsoft ResearchMar 11, 2026arXiv:2603.11137

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Jongwoo Ko, Sara Abdali, Young Jin Kim, Pashmina Cameron

AI Summary

This paper introduces REOPOLD (Relaxed On-Policy Distillation), a novel framework for stabilizing on-policy distillation by interpreting it as policy optimization with the teacher-student log-likelihood ratio as a token reward. REOPOLD mitigates instability through mixture-based reward clipping, entropy-based token-level dynamic sampling, and an exploration-to-refinement training strategy. Experiments across diverse reasoning tasks demonstrate that REOPOLD achieves superior sample efficiency and enhanced test-time scaling, enabling smaller student models to match larger teacher models with significant inference speedups.

Key Contribution

Achieve up to 12x greater sample efficiency in reasoning tasks by relaxing strict imitation constraints in on-policy distillation, enabling smaller models to match the performance of much larger ones.

Abstract

On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References71

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Related Papers