NJUFeb 12, 2026arXiv:2602.11748

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yu Cheng, Tao Lin

AI Summary

The paper addresses the "Shallow Exploration Trap" in in-context learning, where autoregressive models struggle to generate long reasoning trajectories needed for effective state coverage. They introduce Length-Incentivized Exploration (LIE), a reinforcement learning approach that rewards longer reasoning trajectories while penalizing redundancy. Experiments on Qwen3 and Llama models demonstrate that LIE improves in-context exploration, leading to performance gains of 4.4% on in-domain and 2.7% on out-of-domain tasks.

Key Contribution

LLMs can be taught to "think longer" and explore more diverse reasoning paths in-context via a simple length-incentivized reward, leading to improved generalization.

Abstract

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Related Papers