BeihangBITZJUApr 13, 2026arXiv:2604.11510

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Jiashu Yao, Heyan Huang, Chuwei Luo, Daiqing Wu, Zeming Liu, Yuhang Guo, Yangyang Kang

AI Summary

This paper introduces Policy Split, a reinforcement learning paradigm for LLMs that uses a bifurcated policy with normal and high-entropy modes, each optimized with distinct objectives and collaborative dual-mode entropy regularization. The normal mode focuses on task correctness, while the high-entropy mode prioritizes exploration via a high-entropy prompt. Experiments show Policy Split outperforms entropy-guided RL baselines across model sizes and tasks by facilitating dual-mode exploration with unique learning signals.

Key Contribution

Forget monolithic policies – splitting your LLM's RL policy into accuracy-focused and exploration-driven modes unlocks better performance and diversity.

Abstract

To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Related Papers