Stanford HAIPIApr 22, 2025arXiv:2504.16054

π0.5: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, L. Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, James Tanner, Quan Vuong, H. Walke, Anna Walling, Haohuan Wang, Lili Yu, Ury Zhilinsky

AI Summary

The paper introduces $\pi_{0.5}$, a vision-language-action (VLA) model designed for improved generalization in real-world robotic manipulation tasks. The model builds upon $\pi_{0}$ and employs co-training on heterogeneous data sources, including data from multiple robots, web data, and semantic predictions, to enhance its ability to generalize to unseen environments. Experiments demonstrate that $\pi_{0.5}$ can perform long-horizon, dexterous manipulation skills like cleaning a kitchen or bedroom in novel homes, showcasing the effectiveness of knowledge transfer for real-world robotic systems.

Key Contribution

An end-to-end learned robotic system can now clean your kitchen in a completely new house, thanks to a novel co-training approach on diverse data.

Abstract

In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe $\pi_{0.5}$, a new model based on $\pi_{0}$ that uses co-training on heterogeneous tasks to enable broad generalization. $\pi_{0.5}$\ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations473

Influential citations65

References90

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

π0.5: a Vision-Language-Action Model with Open-World Generalization

Related Papers