May 1, 2026arXiv:2605.00416

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Yi Wang, Xincheng Li, Pengwei Xie, Pu Yang, Buqing Nie, Yunuo Cai, Qingling Zhang, Qinglin Zhang, Chendi Qu, Jeffrey Wu, Jianheng Song, Jia-Yi Song, Xinlin Ren, Jingshun Huang, Mingjie Pan, Siyuan Feng, Zhi Chen, Jianlan Luo

AI Summary

This paper introduces Learning While Deploying (LWD), a framework for continual post-training of generalist Vision-Language-Action (VLA) policies using fleet-scale offline-to-online reinforcement learning. LWD leverages autonomous rollouts and human interventions collected across a robot fleet to address distribution shifts and long-tail failures. The method combines Distributional Implicit Value Learning (DIVL) with Q-learning via Adjoint Matching (QAM) to stabilize learning from heterogeneous, sparse-reward fleet data, achieving a 95% average success rate across eight real-world manipulation tasks.

Key Contribution

Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.

Abstract

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References63

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Related Papers