Tsinghua AIAnt GroupCASHKUSTPengcheng LaboratoryPKUMay 31, 2026arXiv:2606.01241

OneVLA: A Unified Framework for Embodied Tasks

Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Lei Zhou, Shuyi Zhang, Jinkun Liu, Hongsheng Li, Chenhao Zhang, Qiang Zhang, Hangjun Ye, Xiaojun Liang, Long Chen, Wenbo Ding

AI Summary

This paper introduces OneVLA, a unified framework that integrates navigation and manipulation tasks for embodied intelligence, addressing the limitations of existing Vision-Language-Action (VLA) models that are typically specialized for either task. By employing a novel unified action head and a multi-stage progressive training strategy that leverages curated data and Chain-of-Thought fine-tuning, OneVLA facilitates significant positive transfer and mutual reinforcement between navigation and manipulation. Experimental results demonstrate that OneVLA achieves state-of-the-art performance in both simulated and real-world environments, outperforming specialized and existing cross-task models, thereby advancing the development of general-purpose robotic agents.

Key Contribution

OneVLA unifies navigation and manipulation tasks into a single framework, enabling robots to seamlessly interpret commands and interact with their environments like never before.

Abstract

Navigation and manipulation are fundamental capabilities of embodied intelligence, enabling robots to interpret natural language commands and interact physically with their surroundings. However, current Vision-Language-Action (VLA) models remain constrained by task-specific architectures, specializing in either navigation or manipulation, which hinders the development of general-purpose robotic agents. To bridge this gap, we introduce OneVLA, a unified architecture that integrates these distinct tasks into a single, cohesive framework. Specifically, we design a unified action head capable of generating both navigation and manipulation actions without requiring task-specific variants. Furthermore, we propose a multi stage progressive training strategy-incorporating curated data construction and Chain-of-Thought (CoT) fine-tuning that facilitates strong positive transfer and mutual reinforcement between the two domains. Extensive experiments in both simulated and real-world environments demonstrate that OneVLA achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models. By unifying these core capabilities, OneVLA paves the way for truly general-purpose robotic systems. The model and source code will be publicly released.

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OneVLA: A Unified Framework for Embodied Tasks

Related Papers