Apr 29, 2026arXiv:2604.26256

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Tianhao Hu, Xiangcheng Liu, Youshao Xiao, Yang Zheng, Xuan Huang, Jinrui Ding, Yufei Zhang, Tao Liang, Tao Liang, Hongyu Zang, Hongyu Zang, Quan Chen, Yueqing Sun, Wenjie Shi, Chao Zhang, Wei Wang, Qimu Gu, Qi Gu, Yerui Sun, Yucheng Xie, Xunliang Cai

AI Summary

This paper introduces DORA, a novel asynchronous reinforcement learning system designed to address the rollout bottleneck in LLM post-training caused by long-tailed trajectories and MoE imbalance. DORA employs multi-version streaming rollout to maintain multiple policy versions concurrently, eliminating bubbles without violating algorithmic constraints like intra-trajectory policy consistency, data integrity, and bounded staleness. Experiments show DORA achieves 2-4x speedup compared to synchronous training and 2-3x higher throughput than existing asynchronous systems, resulting in competitive LLMs like LongCat-Flash-Thinking.

Key Contribution

Asynchronous RL for LLMs doesn't have to sacrifice convergence for speed: DORA achieves 2-4x faster training by cleverly managing multiple policy versions during rollout.

Abstract

Reinforcement learning (RL) has become a critical paradigm for LLM post-training, yet the rollout phase -- accounting for 50--80% of total step time -- is bottlenecked by skewed generation: long-tailed trajectories indispensable for model performance block the entire training pipeline. Asynchronous training offers a natural remedy by overlapping generation with training, but introduces a fundamental tension between efficiency and algorithmic correctness. We identify three constraints in asynchronous training to preserve convergence: intra-trajectory policy consistency, data integrity, and bounded staleness. Existing approaches fail to intrinsically address the long-tailed trajectory problem, which is further exacerbated by the imbalance characteristic of Mix-of-Experts models, or deviate from the standard RL training formulation, thereby hindering model convergence. Therefore, we propose DORA (Dynamic ORchestration for Asynchronous Rollout), which addresses this challenge through algorithm-system co-design. DORA introduces multi-version streaming rollout, a novel asynchronous paradigm that maintains multiple policy versions concurrently -- simultaneously achieving full bubble elimination without compromising algorithmic constraints. Experimental results demonstrate that our DORA system achieves substantial improvements in throughput -- up to 2--3 times higher than state-of-the-art systems on open-source benchmarks -- without compromising convergence. Furthermore, in large-scale industrial applications with tens of thousands of accelerators, DORA accelerates RL training by 2--4 times compared to synchronous training across various scenarios. The resultant open-source models, LongCat-Flash-Thinking, exhibit competitive performance on complex reasoning benchmarks, matching the capability of most advanced LLMs.

Distributed Systems & Hardware RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Related Papers