AI2Paul G. Allen School of Computer ScienceApr 9, 2026arXiv:2604.08516

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Petr Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Bo Zheng, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, H. Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, Ranjay Krishna, Ranjay Krishna

AI Summary

The authors introduce MolmoWebMix, a large-scale dataset comprising synthetic and human demonstrations for web navigation, and MolmoWeb, a family of open-weight multimodal web agents trained on this dataset. MolmoWeb agents operate as instruction-conditioned visual-language action policies, predicting browser actions from webpage screenshots without relying on HTML or specialized APIs. MolmoWeb-8B achieves state-of-the-art performance among open-weight models and even surpasses larger, closed-source models like GPT-4o on certain benchmarks, demonstrating the potential of open data and models for web agent development.

Key Contribution

Open-source web agents can now outperform GPT-4o on key web navigation tasks, thanks to a new dataset and model family that levels the playing field.

Abstract

Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

Data Curation & Synthetic Data Open-Source Models & Weights Tool Use & Agents

Citation Metrics

Citations1

Influential citations0

References64

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Related Papers