OPPOJun 8, 2026arXiv:2606.09091

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

AI Summary

This paper introduces Globally Normalized Distillation Policy Optimization (GNDPO), a method designed to stabilize on-policy distillation (OPD) for multimodal large language model (MLLM) reasoning. By transforming raw KL scores into batch-level relative advantages, GNDPO mitigates gradient instability associated with naive token-level distillation, which can lead to performance degradation in outlier states. Experimental results demonstrate that GNDPO significantly enhances training robustness and downstream performance compared to traditional reinforcement learning with verifiable rewards (RLVR).

Key Contribution

Gradient explosions in token-level distillation are tamed by a novel normalization technique, leading to robust improvements in multimodal reasoning tasks.

Abstract

On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.

Inference & Quantization Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

Related Papers