Apr 22, 2026arXiv:2604.20100

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Tianle Zhang, Zhihao Yuan, Dafeng Chi, Peidong Liu, Dongwei Li, Kejun Hu, Likui Zhang, Jun Nie, Ziming Wei, Zengjue Chen, Yili Tang, Jiayi Li, Zhiyuan Xiang, Mingyang Li, Tianci Luo, Hanwen Wan, Ao Li, Linbo Zhai, Zhihao Zhan, Yuzheng Zhuang, Liang Lin, Xiaodong Bai, Jiakun Cai, Pengcai Cao, Kangliang Chen, Siang Chen, Yixiang Dai, Shuai Di, Nan Duan, Yicheng Gong, Chenguang Gui, Yucheng Guo, Peng Hao, Qingrong He, Haoyang Huang, Kunrui Huang, Zhixuan Huang, Yixiang Jin, Anson Li, Dongjiang Li, Jiawei Li, Ruodai Li, Yihang Li, Yuzhen Li, Jiaming Liang, Fangsheng Liu, Jing Long, Mingxi Luo, Xing Pan, Xiaomeng Tian, Daming Wang, Junwu Xiong, Hang Xu, Wanting Xu, Zhe Yu, He Zhang, Jiyao Zhang, Lin Zhao, Chen Zhou

AI Summary

JoyAI-RA is introduced as a vision-language-action (VLA) foundation model for robotic manipulation, addressing limitations in data diversity and cross-embodiment generalization. The model is trained using a multi-source multi-level pretraining framework that integrates web data, egocentric human videos, simulation trajectories, and real-robot data, with explicit action-space unification. JoyAI-RA demonstrates superior performance over existing methods in both simulated and real-world robotic tasks, particularly those requiring generalization.

Key Contribution

Bridging the gap between human manipulation and robotic control, JoyAI-RA unlocks enhanced cross-embodiment behavior learning through multi-source pretraining.

Abstract

Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References45

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Related Papers