CASHKUSTPKUUCLAUNSWXJUApr 3, 2026arXiv:2604.03016

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Qianshan Wei, Yishan Yang, Yi-Shuai Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Yuqing Tang, Weining Wang, Yi Yu, Yinfeng Yu, Chaoyou Fu, Qi Li, Qi Li, Yifan Zhang

AI Summary

The paper introduces Agentic-MME, a new benchmark designed to evaluate the agentic capabilities of Multimodal Large Language Models (MLLMs) by focusing on their ability to effectively utilize visual and knowledge expansion tools. Unlike existing benchmarks, Agentic-MME emphasizes process-level verification through stepwise checkpoints and a unified evaluation framework that supports sandboxed code and APIs. Experiments using Agentic-MME reveal that even state-of-the-art models like Gemini3-pro struggle with complex, real-world tasks, achieving only 23.0% accuracy on the most difficult level.

Key Contribution

Current MLLM benchmarks are missing the forest for the trees: Agentic-MME reveals that strong final-answer accuracy masks surprisingly poor tool use and planning in complex multimodal tasks.

Abstract

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Related Papers