Tsinghua AIHITFeb 16, 2026arXiv:2602.14386

Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

Mufan Xu, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Muyun Yang, Tiejun Zhao

AI Summary

The paper introduces Multi-token Policy Gradient Optimization (MPO), a policy gradient method that treats sequences of K consecutive tokens as unified semantic actions to better capture the compositional structure of complex reasoning tasks. MPO addresses the mismatch between token-level optimization and the block-level nature of reasoning, where a single semantic decision spans multiple tokens. Experiments on mathematical reasoning and coding benchmarks demonstrate that MPO outperforms standard token-level policy gradient baselines, highlighting the benefits of optimizing at a higher level of granularity.

Key Contribution

Token-level policy gradients fall short in complex reasoning tasks, but treating sequences of tokens as unified actions can significantly boost performance in mathematical and coding benchmarks.

Abstract

Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

Related Papers