SJTUUSTCMay 24, 2026arXiv:2605.24956

NITP: Next Implicit Token Prediction for LLM Pre-training

Xiangdong Zhang, Debing Zhang, Shaofeng Zhang, Xiaohan Qin, Yu Cheng, Junchi Yan

AI Summary

This paper introduces Next Implicit Token Prediction (NITP), a novel approach that enhances standard next-token prediction by incorporating dense continuous supervision in the representation space. By training models to predict the implicit semantic content of the next token using stable self-supervised targets, NITP mitigates the limitations of sparse one-hot supervision, leading to improved generalization. Empirical results demonstrate that NITP significantly boosts performance across various model sizes, achieving a 5.7% absolute improvement on MMLU-Pro for a 9B MoE model with minimal additional computational cost.

Key Contribution

NITP achieves a remarkable 5.7% performance boost on MMLU-Pro by transforming how LLMs are trained, moving beyond sparse supervision to dense semantic predictions.

Abstract

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.

Natural Language Processing Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

NITP: Next Implicit Token Prediction for LLM Pre-training

Related Papers