CASMar 19, 2026arXiv:2603.18953

Context Bootstrapped Reinforcement Learning

Saaket Agashe, Jayanth Srinivasa, Gaowen Liu, Ramana Kompella, R. Kompella, Xin Wang, Xin Eric Wang

AI Summary

Context Bootstrapped Reinforcement Learning (CBRL) addresses the exploration challenges in Reinforcement Learning from Verifiable Rewards (RLVR) by stochastically prepending few-shot demonstrations to training prompts, following an annealing curriculum. This method forces the policy to internalize reasoning patterns from demonstrations, rather than relying on them at test time. Experiments across Reasoning Gym tasks and the Q programming language show that CBRL improves success rate and exploration efficiency across different model families and algorithms.

Key Contribution

Injecting demonstrations with a carefully annealed probability can drastically improve exploration in RLVR, even for tasks requiring novel reasoning or domain-specific knowledge.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Context Bootstrapped Reinforcement Learning

Related Papers