London School of Economics and PoliticalUniversity of Ox- fordUniversity of Science and Tech- nologyMay 26, 2026arXiv:2605.27293

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

Shijin Gong, Erhan Xu, Kai Ye, Francesco Quinzan, Giulia Livieri, Chengchun Shi

AI Summary

The paper introduces BASIS, a critic-free reinforcement learning algorithm for LLMs that improves value function estimation by sharing information across prompts within a batch using single-rollout trajectories. BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++ and achieves lower MSE than group mean estimators with 8x more rollouts. This leads to improved policy optimization, achieving performance close to multi-rollout GRPO-type baselines with significantly less training time.

Key Contribution

Single-rollout RL can rival multi-rollout performance for LLM reasoning, thanks to a new batchwise advantage estimation technique that dramatically improves value function accuracy.

Abstract

Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

Related Papers