Mar 19, 2026arXiv:2603.18811

V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors

Songjiang He, Songjia He, Zixuan Chen, Hongyu Ding, Dian Shao, Jieqi Shi, Chenxu Li, Jing Huo, Yang Gao

AI Summary

V-Dreamer automates the creation of robotic simulation environments and expert trajectories from natural language instructions by combining LLMs, 3D generative models, and video generation models. It constructs physically plausible 3D scenes and leverages video generation as a motion prior, mapping visual predictions to executable robot trajectories via a Sim-to-Gen module. Imitation learning policies trained on V-Dreamer's synthesized data demonstrate strong generalization to unseen objects in simulation and effective sim-to-real transfer on a Piper robotic arm.

Key Contribution

Forget hand-crafted assets and heuristics: V-Dreamer uses video generation models to automatically create diverse, physically plausible robotic simulation environments and trajectories directly from language.

Abstract

Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer employs a novel generative pipeline that constructs physically grounded 3D scenes using large language models and 3D generative models, validated by geometric constraints to ensure stable, collision-free layouts. Crucially, for behavior synthesis, we leverage video generation models as rich motion priors. These visual predictions are then mapped into executable robot trajectories via a robust Sim-to-Gen visual-kinematic alignment module utilizing CoTracker3 and VGGT. This pipeline supports high visual diversity and physical fidelity without manual intervention. To evaluate the generated data, we train imitation learning policies on synthesized trajectories encompassing diverse object and environment variations. Extensive evaluations on tabletop manipulation tasks using the Piper robotic arm demonstrate that our policies robustly generalize to unseen objects in simulation and achieve effective sim-to-real transfer, successfully manipulating novel real-world objects.

Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors

Related Papers