Search papers, labs, and topics across Lattice.
This paper introduces Token-level Policy Optimization (TEPO), a novel framework addressing the challenge of token-level sparse rewards in chain-of-thought reasoning for LLMs. TEPO links group-level rewards to individual tokens using sequence-level likelihood and employs a token-level KL-Divergence mask constraint to stabilize training. Experiments show TEPO achieves state-of-the-art mathematical reasoning performance and reduces convergence time by 50% compared to GRPO/DAPO.
LLMs can now learn mathematical reasoning 2x faster and with greater stability, thanks to a new token-level policy optimization method.
Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.