Apr 29, 2026arXiv:2604.27083

Co-Evolving Policy Distillation

Naibin Gu, Chenxu Yang, Qingyi Si, Chuanyu Qin, Dingyu Yao, Peng Fu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang

AI Summary

This paper analyzes capability loss in Reinforcement Learning via Value Regularization (RLVR) and Offline Policy Distillation (OPD) when consolidating multiple expert capabilities. It identifies inter-capability divergence in mixed RLVR and behavioral pattern gaps in sequential expert training followed by OPD. To address these issues, the authors propose Co-Evolving Policy Distillation (CoPD), which trains experts in parallel with bidirectional OPD, enabling more consistent behavioral patterns and knowledge transfer.

Key Contribution

Forget training experts sequentially – Co-Evolving Policy Distillation (CoPD) unlocks all-in-one integration of diverse reasoning capabilities by training experts in parallel with mutual teaching, outperforming even domain-specific experts.

Abstract

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.

Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Co-Evolving Policy Distillation

Related Papers