CASCorresponding AuthorTencent AIMay 27, 2026arXiv:2605.28184

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

Zili Wang, Jiajun Chai, Xiaohan Wang, Shiming Xiang, Guojun Yin

AI Summary

This paper analyzes why joint training of Multi-Token Prediction (MTP) with Reinforcement Learning from Verifiable Rewards (RLVR) degrades performance, decomposing the MTP effect into a correlation term and a perturbation penalty. They show that existing MTP training regimes fail due to either neglecting the correlation or suffering from a persistent quadratic penalty. To address this, they introduce Optimal Coefficient Calibration (OCC), an adaptive scheme that dynamically adjusts the MTP coefficient. Experiments on mathematical reasoning benchmarks demonstrate that OCC achieves performance matching or exceeding the detached MTP baseline.

Key Contribution

Jointly training MTP and RL doesn't have to hurt: a simple coefficient calibration scheme unlocks performance gains on mathematical reasoning tasks.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

Related Papers