Snowflake AI ResearchUCSDUIUCApr 8, 2026arXiv:2605.02913

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, N. Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan A. Rossi, Kuan-Hao Huang, Jingbo Shang, Jiawei Han, Julian McAuley

AI Summary

This survey formalizes rollout strategies for RL-based post-training of reasoning LLMs, introducing a Generate-Filter-Control-Replay (GFCR) taxonomy to decompose rollout pipelines into modular stages. GFCR encompasses generation, filtering, control, and replay mechanisms, offering a structured view of methods like verifiable rewards, process supervision, and adaptive compute allocation. The paper grounds GFCR with case studies across various reasoning tasks and provides a diagnostic index for rollout pathologies, highlighting open challenges for reproducible and efficient rollout pipelines.

Key Contribution

Rollout design in LLM reinforcement learning is more than just sampling trajectories – it's a modular pipeline you can optimize for reliability, coverage, and cost.

Abstract

Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References177

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

Related Papers