Search papers, labs, and topics across Lattice.
This paper addresses the critical challenge of managing the exploration and exploitation trade-off in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). By introducing a perplexity space disentangling strategy, the authors effectively categorize samples into high and low perplexity subspaces, facilitating a more nuanced exploration-exploitation balance. Experimental results on mathematical reasoning and function calling tasks reveal that their approach significantly enhances LLM performance, underscoring its practical utility in fine-grained policy optimization.
Fine-tuning the exploration-exploitation balance can dramatically boost LLM reasoning capabilities, as shown by our novel perplexity-guided strategy.
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.