MayUNCFeb 26, 2026arXiv:2602.23166

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su, Jincheng Gao, Hangyu Guo, Zhenhua Liu, Zhenhua Liu, Lu Zhang, Lueyang Zhang, Xinyu Geng, Shijue Huang, Peng Xia, Guanyu Jiang, Cheng Wang, Yue Zhang, Yi R. Fung, Junxian He

AI Summary

AgentVista is introduced as a new benchmark for evaluating generalist multimodal agents, focusing on realistic and detail-rich visual scenarios requiring long-horizon tool interactions across modalities. The benchmark spans 25 sub-domains across 7 categories, demanding web search, image search, page navigation, and code-based operations. Evaluation of state-of-the-art models on AgentVista reveals substantial limitations, with the best model, Gemini-3-Pro with tools, achieving only 27.3% overall accuracy, highlighting the need for advancements in long-horizon multimodal reasoning and tool use.

Key Contribution

Even the best multimodal agents struggle with realistic visual scenarios, achieving only 27% accuracy on the new AgentVista benchmark that demands long-horizon tool use across web search, image search, and code.

Abstract

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Related Papers