Sep 23, 2025arXiv:2509.19128

PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

Alex Pich'e, Ehsan Kamalloo, Rafael Pardinas, Xiaoyin Chen, Dzmitry Bahdanau

AI Summary

The paper introduces PipelineRL, a novel on-policy reinforcement learning approach for training LLMs that addresses the challenge of maintaining high accelerator utilization without generating stale, off-policy data. PipelineRL achieves this by employing concurrent asynchronous data generation and model training with in-flight weight updates, allowing the LLM generation engine to receive updated model weights with minimal interruption. Experiments on long-form reasoning tasks using 128 H100 GPUs show that PipelineRL achieves approximately 2x faster learning compared to conventional RL baselines while maintaining highly on-policy training data.

Key Contribution

Double your RL fine-tuning speed for LLMs with PipelineRL's in-flight weight updates that keep training data fresh.

Abstract

Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effectively scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately $\sim 2x$ faster learning compared to conventional RL baselines while maintaining highly on-policy training data. A scalable and modular open-source implementation of PipelineRL is also released as a key contribution.

Distributed Systems & Hardware RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations7

Influential citations0

References25

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

Related Papers