PrincetonWaterlooJun 4, 2026arXiv:2606.06538

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Yida Yin, H. Krishnakumar, Chung Peng Lee, Boya Zeng, Wenhao Chai, Shengbang Tong, Wenhu Chen, Hu Xu, Xingyu Fu, Gabriel Sarch, Aleksandra Korolova, Zhuang Liu

AI Summary

This paper introduces WorldBench, a new multimodal reasoning benchmark designed to evaluate Multimodal Large Language Models (MLLMs) by emphasizing visual diversity across various domains. By creating a comprehensive taxonomy of visual concepts and curating a diverse image collection, the authors crafted challenging questions that expose the limitations of current MLLMs in visual understanding. The evaluation of 15 MLLMs on WorldBench reveals significant performance gaps, with the best model achieving only 64.0% accuracy, underscoring the necessity for benchmarks that reflect real-world visual complexity.

Key Contribution

Even the top-performing MLLMs struggle with visual reasoning, achieving only 64% accuracy on a benchmark designed to reflect real-world diversity.

Abstract

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References71

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Related Papers