TRIUSCMar 17, 2026arXiv:2603.16860

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Emily Yue-Ting Jia, E. Jia, Weiduo Yuan, Tianheng Shi, V. Guizilini, Vitor Guizilini, Jiageng Mao, Yue Wang

AI Summary

DreamPlan addresses the challenge of grounding Vision-Language Model (VLM) planners in real-world physics for robotic manipulation by fine-tuning them with reinforcement learning within a learned video world model. The method first uses the zero-shot VLM to collect exploratory interaction data, which is then used to train an action-conditioned video generation model. Finally, the VLM planner is fine-tuned within this video world model using Odds Ratio Policy Optimization (ORPO), achieving improved manipulation success rates without extensive real-world data collection.

Key Contribution

Fine-tuning Vision-Language Model planners for robotic manipulation is now significantly more efficient and safer thanks to a novel framework that leverages video world models to simulate real-world physics.

Abstract

Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We demonstrate that this sub-optimal data is sufficient to train an action-conditioned video generation model, which implicitly captures complex real-world physics. Subsequently, the VLM planner is fine-tuned entirely within the"imagination"of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task-specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large-scale real-world data collection. Our project page is https://psi-lab.ai/DreamPlan/.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Related Papers