Search papers, labs, and topics across Lattice.
This paper introduces Globally Normalized Distillation Policy Optimization (GNDPO), a method designed to stabilize on-policy distillation (OPD) for multimodal large language model (MLLM) reasoning. By transforming raw KL scores into batch-level relative advantages, GNDPO mitigates gradient instability associated with naive token-level distillation, which can lead to performance degradation in outlier states. Experimental results demonstrate that GNDPO significantly enhances training robustness and downstream performance compared to traditional reinforcement learning with verifiable rewards (RLVR).
Gradient explosions in token-level distillation are tamed by a novel normalization technique, leading to robust improvements in multimodal reasoning tasks.
On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.