Search papers, labs, and topics across Lattice.
This paper introduces GEM, a Generative-supervised Embodied vision-language Model, to improve embodied intelligence by incorporating depth map generation into VLM pre-training. They jointly train the VLM with this generative objective using the new GEM-4M dataset, which contains grounding, reasoning, and planning data paired with high-quality depth supervision. Results show that GEM achieves state-of-the-art performance across diverse embodied benchmarks and exhibits superior task execution in simulation and real-world environments.
Teaching VLMs to predict depth maps during pre-training unlocks surprisingly large gains in real-world robot task execution.
Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/