May 28, 2026arXiv:2605.30280

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Ji-lu Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, G. Zhou, Hang Yin, Ye Wang, Yebin Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Xinyu Zhou, Haoyang Li, Anzhe Chen, An-Jen Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Rui Chen, Ruizhe Chen, Zhaohai Li, Chenxu Lu, Chenxu Lü, Zhibo Yang, Tao Yu, Xiong-hui Chen

AI Summary

Qwen-VLA is introduced as a unified embodied foundation model extending Qwen's vision-language capabilities to continuous action and trajectory generation using a DiT-based action decoder. The model is trained on a large-scale dataset comprising diverse sources like robotics trajectories, egocentric demonstrations, and VLN data, using embodiment-aware prompt conditioning to handle multiple robot platforms. Experiments demonstrate strong multi-task performance and generalization across manipulation, navigation, and trajectory prediction benchmarks, showcasing its ability to transfer visual grounding and spatial reasoning across different robot embodiments and environments.

Key Contribution

One model to control them all: Qwen-VLA achieves impressive zero-shot generalization across diverse robotic tasks and embodiments by unifying vision-language-action modeling.

Abstract

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References90

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Related Papers