Search papers, labs, and topics across Lattice.
The paper introduces Dual-Scale Diversity Regularization (DSDR), a reinforcement learning framework designed to improve exploration in LLM reasoning by decomposing diversity into global (trajectory-level) and local (token-level) components. DSDR encourages diversity among correct reasoning trajectories to explore different solution modes while applying length-invariant token-level entropy regularization within each mode to prevent collapse. Theoretical analysis demonstrates that DSDR preserves optimal correctness under bounded regularization and sustains informative learning signals, and empirical results on reasoning benchmarks show improved accuracy and pass@k.
LLMs can reason better if you force them to explore *different* ways of being right, not just be more random.
Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.