DartmouthMar 20, 2026arXiv:2603.19835

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, S. Vosoughi, Guoyin Wang, Jingren Zhou

AI Summary

The paper introduces Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm that improves reasoning in large language models by addressing the limitations of outcome-based reward models. FIPO incorporates discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior, thus improving credit assignment. Experiments on Qwen2.5-32B demonstrate that FIPO extends the average chain-of-thought length and increases AIME 2024 Pass@1 accuracy, outperforming other models.

Key Contribution

LLMs can reason through chains of thought 2.5x longer and solve more complex math problems by optimizing for the influence of each token on future reasoning steps.

Abstract

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Related Papers