Search papers, labs, and topics across Lattice.
This paper explores the potential of using the Veo-3 video generation model for generalizable robot manipulation by predicting future image sequences and using an inverse dynamics model (IDM) to recover robot actions. They find that while Veo-3+IDM generates reasonable task-level trajectories, low-level control accuracy is lacking. To address this, they propose Veo-Act, a hierarchical framework using Veo-3 for high-level planning and a vision-language-action (VLA) policy for low-level execution, significantly improving instruction-following performance.
Frontier video models like Veo-3 can generate surprisingly good task-level plans for robot manipulation, but still need help with the fine details.
Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this"Veo-3+IDM"approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong generalization capability of frontier video models, Veo-3+IDM can consistently generate approximately correct task-level trajectories. However, its low-level control accuracy remains insufficient to solve most tasks reliably. Motivated by this observation, we develop a hierarchical framework, Veo-Act, which uses Veo-3 as a high-level motion planner and a VLA policy as the low-level executor, significantly improving the instruction-following performance of a state-of-the-art vision-language-action policy. Overall, our results suggest that, as video generation models continue to improve, video models can be a valuable component for generalizable robot learning.