Tsinghua AIApr 6, 2026arXiv:2604.04502

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Zhongru Zhang, Cheng‐Chuan Yang, Chenghan Yang, Qingzhou Lu, Jianke Zhang, Yucheng Hu, Jianyu Chen

AI Summary

This paper explores the potential of using the Veo-3 video generation model for generalizable robot manipulation by predicting future image sequences and using an inverse dynamics model (IDM) to recover robot actions. They find that while Veo-3+IDM generates reasonable task-level trajectories, low-level control accuracy is lacking. To address this, they propose Veo-Act, a hierarchical framework using Veo-3 for high-level planning and a vision-language-action (VLA) policy for low-level execution, significantly improving instruction-following performance.

Key Contribution

Frontier video models like Veo-3 can generate surprisingly good task-level plans for robot manipulation, but still need help with the fine details.

Abstract

Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this"Veo-3+IDM"approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong generalization capability of frontier video models, Veo-3+IDM can consistently generate approximately correct task-level trajectories. However, its low-level control accuracy remains insufficient to solve most tasks reliably. Motivated by this observation, we develop a hierarchical framework, Veo-Act, which uses Veo-3 as a high-level motion planner and a VLA policy as the low-level executor, significantly improving the instruction-following performance of a state-of-the-art vision-language-action policy. Overall, our results suggest that, as video generation models continue to improve, video models can be a valuable component for generalizable robot learning.

Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References48

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Related Papers