Ant GroupECNUIllinois Institute of TechnologyPolyUShanghai InnovationXiamen UniversityApr 15, 2026arXiv:2604.13902

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Xiaofan Li, Ming Yang, Zhiyuan Ma, Shichao Ma, Jintao Du, Xin Tan, Yanyun Qu, Lizhuang Ma, Yuan Xie

AI Summary

This paper addresses the critical challenge of managing the exploration and exploitation trade-off in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). By introducing a perplexity space disentangling strategy, the authors effectively categorize samples into high and low perplexity subspaces, facilitating a more nuanced exploration-exploitation balance. Experimental results on mathematical reasoning and function calling tasks reveal that their approach significantly enhances LLM performance, underscoring its practical utility in fine-grained policy optimization.

Key Contribution

Fine-tuning the exploration-exploitation balance can dramatically boost LLM reasoning capabilities, as shown by our novel perplexity-guided strategy.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Related Papers