CASApr 15, 2026arXiv:2604.14142

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang

AI Summary

The paper introduces PreRL, a method that applies reinforcement learning directly to the marginal distribution P(y) in the pre-training space of LLMs, addressing the limitations of RL with verifiable rewards (RLVR) which is bounded by the base model's output distribution. They theoretically and empirically validate that optimizing P(y) is a viable surrogate for standard RL by demonstrating strong gradient alignment between log P(y) and log P(y|x). They further propose Dual Space RL (DSRL), which uses Negative Sample Reinforcement (NSR) within PreRL to prune incorrect reasoning spaces before transitioning to standard RL, leading to significant performance gains.

Key Contribution

LLMs can be made to reason much better by directly optimizing their pre-training output distribution, even before fine-tuning on specific tasks.

Abstract

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References71

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

Related Papers