GalbotPKUMay 6, 2025arXiv:2505.03233

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Shengliang Deng, Mi Yan, Songlin Wei, Hai-Liang Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Heming Cui, Zhizheng Zhang, He Wang

AI Summary

The paper introduces GraspVLA, a Vision-Language-Action model for robotic grasping pre-trained on SynGrasp-1B, a newly created billion-frame synthetic dataset with photorealistic rendering and domain randomization. GraspVLA uses a Chain-of-Thought process integrating autoregressive perception and flow-matching-based action generation, enabling joint training on synthetic action data and Internet semantics data. Experiments demonstrate GraspVLA's strong zero-shot generalization and few-shot adaptability in both simulated and real-world grasping tasks, showing the potential of large-scale synthetic data for training embodied foundation models.

Key Contribution

Forget painstakingly labeled real-world data – GraspVLA proves you can train a surprisingly capable grasping foundation model on a billion frames of purely synthetic action data.

Abstract

Embodied foundation models are gaining increasing attention for their zero-shot generalization, scalability, and adaptability to new tasks through few-shot post-training. However, existing models rely heavily on real-world data, which is costly and labor-intensive to collect. Synthetic data offers a cost-effective alternative, yet its potential remains largely underexplored. To bridge this gap, we explore the feasibility of training Vision-Language-Action models entirely with large-scale synthetic action data. We curate SynGrasp-1B, a billion-frame robotic grasping dataset generated in simulation with photorealistic rendering and extensive domain randomization. Building on this, we present GraspVLA, a VLA model pretrained on large-scale synthetic action data as a foundational model for grasping tasks. GraspVLA integrates autoregressive perception tasks and flow-matching-based action generation into a unified Chain-of-Thought process, enabling joint training on synthetic action data and Internet semantics data. This design helps mitigate sim-to-real gaps and facilitates the transfer of learned actions to a broader range of Internet-covered objects, achieving open-vocabulary generalization in grasping. Extensive evaluations across real-world and simulation benchmarks demonstrate GraspVLA's advanced zero-shot generalizability and few-shot adaptability to specific human preferences. We will release SynGrasp-1B dataset and pre-trained weights to benefit the community.

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations70

Influential citations5

References85

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Related Papers