Tsinghua AITencent AIJun 12, 2026arXiv:2606.14409

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

He Zhang, Lingzhu Xiang, Haitao Lin, Zeyu Huang, Minghui Wang, Dingyan Zhong, Yubo Dong, Yihao Wu, Yongming Rao, Dongsheng Zhang, Wanjia He, Ling Chen, Kai Huang, Jiahao Chen, Sichang Su, Xumin Yu, Ziyi Wang, Chengwei Zhu, Xiao Teng, Yuchun Guo, Yufeng Zhang, Yuandong Liu, Rui Wang, Zisheng Lu, Han Hu, Zhengyou Zhang

AI Summary

The paper introduces Hy-Embodied-0.5-VLA (HyVLA-0.5), an integrated system that encompasses the entire robot learning pipeline, from data collection to real-world deployment. This comprehensive approach is designed to enhance the capabilities of vision-language-action models in practical robotic applications. Key results demonstrate significant improvements in performance and adaptability when transitioning from simulated environments to real-world tasks.

Key Contribution

A fully integrated robot learning stack that bridges the gap from simulation to real-world deployment, enhancing the efficacy of vision-language-action models.

Abstract

In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

Related Papers