GigaAINTUQueen'sShanghai AI LabFeb 12, 2026arXiv:2602.12099

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain Team Boyuan Wang, Chaojun Ni, Guan Huang, Hao Li, Jie Li, Jingyu Liu, Lv Feng, Tianze Liu, Xinyu Zhou, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yifei Nie, Yilong Li, Yukun Zhou, Zhichao Liu

AI Summary

The authors introduce GigaBrain-0.5M*, a vision-language-action (VLA) model that leverages world model-based reinforcement learning to overcome limitations in scene understanding and future anticipation. They employ RAMP (Reinforcement leArning via world Model-conditioned Policy) to fine-tune GigaBrain-0.5, a model pre-trained on 10,000 hours of robotic manipulation data. The resulting GigaBrain-0.5M* demonstrates a 30% performance improvement over the RECAP baseline on complex manipulation tasks and exhibits reliable long-horizon execution in real-world deployments.

Key Contribution

Forget end-to-end VLAs: GigaBrain-0.5M* leverages world models and reinforcement learning to achieve a 30% performance boost on complex robotic manipulation tasks, showcasing reliable long-horizon execution.

Abstract

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References81

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

Related Papers