Tsinghua AIMar 9, 2026arXiv:2603.08660

How Far Can Unsupervised RLVR Scale LLM Training?

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xue-Juan Zhu, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Huanlin Gao, Yuchen Zhang, Bowen Zhou, Bo Zhou, Zhiyuan Liu, Ning Ding

AI Summary

This paper analyzes unsupervised reinforcement learning with verifiable rewards (URLVR) for scaling LLM training, categorizing methods into intrinsic and external reward sources. It establishes a theoretical framework showing intrinsic methods converge towards sharpening the model's initial distribution, succeeding only when initial confidence aligns with correctness. Empirical results demonstrate a consistent rise-then-fall performance pattern for intrinsic rewards, with collapse timing determined by model prior, while external reward methods show promise in escaping these limitations.

Key Contribution

Intrinsic reward signals in unsupervised RL for LLMs inevitably collapse due to sharpening of the model's prior, but external rewards grounded in computational asymmetries offer a path to sustained scaling.

Abstract

Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.

RLHF & Preference Learning Scalable Oversight & Alignment Theory Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How Far Can Unsupervised RLVR Scale LLM Training?

Related Papers