Jan 31, 2026arXiv:2602.00919

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

I. Apanasevich, M. Artemyev, R. Babakyan, P. Fedotova, D. Grankin, E. Kupryashin, A. Misailidi, D. Nerus, A. Nutalapati, G. Sidorov, I. Efremov, M. Gerasyov, D. Pikurov, Y. Senchenko, S. Davidenko, D. Kulikov, M. Sultankin, K. Askarbek, O. Shamanin, D. Statovoy, E. Zalyaev, I. Zorin, A. Letkin, E. Rusakov, A. Silchenko, V. Vorobyov, S. Sobolnikov, A. Postnikov

AI Summary

The paper introduces Green-VLA, a staged Vision-Language-Action framework designed for generalist robots capable of operating across diverse embodiments. They train the model using a five-stage curriculum, incorporating 3,000 hours of demonstrations and a unified, embodiment-aware action interface. Experimental results on simulated and real-world robots demonstrate that Green-VLA achieves strong generalization and performance improvements, particularly through reinforcement learning alignment, enhancing success rate, robustness, and long-horizon efficiency.

Key Contribution

A single VLA policy can now control diverse robots—humanoids, mobile manipulators, and fixed-base arms—thanks to a unified, embodiment-aware action interface and staged training.

Abstract

We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Related Papers