NVIDIAMar 17, 2026arXiv:2603.16861

MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation

Abhay Deshpande, Maya Guru, Rose Hendrix, Snehal Jauhri, Ainaz Eftekhar, Wilbert Pumacay, Yejin Kim, Max Argus, Quinn Pfeifer, Jordi Salvador, Ying-Chun Lee, Haoquan Fang, Piper Wolters, Omar Rayyan, Matthew Wallingford, Mingtong Zhang, Karen Farley, Winson Han, Eli VanderBilt, Dieter Fox, Ali Farhadi, Georgia Chalvatzaki, Jiafei Duan, Dhruv Shah, Ranjay Krishna

AI Summary

This paper introduces MolmoBot, a system for zero-shot sim-to-real transfer in robotic manipulation, challenging the assumption that real-world data or fine-tuning is necessary. They leverage MolmoBot-Engine, an open-source pipeline for procedural data generation, to create MolmoBot-Data, a dataset of 1.8 million expert trajectories. Trained policies, including a Molmo2-based model, achieve strong zero-shot performance on both tabletop and mobile manipulation tasks, demonstrating the effectiveness of large-scale, diverse simulated training data.

Key Contribution

Forget expensive real-world data collection: a massive, diverse synthetic dataset enables surprisingly effective zero-shot transfer for robotic manipulation.

Abstract

A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation. We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.8 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the $\pi_0$ architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms: the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real world evaluations across 4 settings, outperforming $\pi_{0.5}$ at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world. Technical Blog: https://allenai.org/blog/molmobot-robot-manipulation

Data Curation & Synthetic Data Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation

Related Papers