CMU MLFeb 23, 2026arXiv:2602.20119

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, George Konidaris

AI Summary

NovaPlan is introduced, a hierarchical framework for zero-shot long-horizon manipulation that combines closed-loop VLM planning with geometrically grounded robot execution. The system uses a VLM to decompose tasks into sub-goals and monitor execution, replanning when necessary, and extracts kinematic priors from generated videos (object keypoints and hand poses) to guide low-level robot actions. Experiments on long-horizon tasks and the Functional Manipulation Benchmark demonstrate NovaPlan's ability to perform complex assembly and error recovery without training.

Key Contribution

Robots can now perform intricate assembly tasks and recover from errors in real-time, without any training, by fusing vision-language models with video-based kinematic priors for action planning.

Abstract

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Related Papers