Mar 15, 2026arXiv:2603.14523

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S. Rawat, Yunhao Ge, Yuzhang Shang

AI Summary

The paper introduces VLA-Thinker, a framework that enhances Vision-Language-Action models by enabling them to dynamically reason with visual inputs during task execution. This is achieved through a two-stage training process: first, Supervised Fine-Tuning (SFT) with visual Chain-of-Thought data to initiate structured reasoning and tool use, followed by GRPO-based reinforcement learning to align reasoning-action trajectories with task success. Experiments on LIBERO and RoboTwin 2.0 show that VLA-Thinker significantly improves manipulation performance, achieving a 97.5% success rate on LIBERO.

Key Contribution

VLA-Thinker lets robots actively "look again" at their environment during long tasks, leading to a huge performance boost in complex manipulation.

Abstract

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Related Papers