Search papers, labs, and topics across Lattice.
Context Bootstrapped Reinforcement Learning (CBRL) addresses the exploration challenges in Reinforcement Learning from Verifiable Rewards (RLVR) by stochastically prepending few-shot demonstrations to training prompts, following an annealing curriculum. This method forces the policy to internalize reasoning patterns from demonstrations, rather than relying on them at test time. Experiments across Reasoning Gym tasks and the Q programming language show that CBRL improves success rate and exploration efficiency across different model families and algorithms.
Injecting demonstrations with a carefully annealed probability can drastically improve exploration in RLVR, even for tasks requiring novel reasoning or domain-specific knowledge.
Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.