Search papers, labs, and topics across Lattice.
Ant Digital Technologies Ant Group 3 RUC 4 FZU 5 THU 6 USTB 7 PKU Abstract The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models’ visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench 1 Introduction Driven by the rapid advancements in large language models (LLMs) and agent technologies, a plethora of high-quality works has emerged in the deep research domain [gunjal2025rubrics; deepeyesv2; shao2025dr; yao2026researcher]. However, existing deep research benchmarks predominantly focus on the textual modality, thereby neglecting the multimodal demands inherent in real-world retrieval scenarios. Concurrently, the evolution of multimodal large language models (MLLMs) has inspired a series of works on Multimodal Browsing Agents [yao2026mm; yan2025comprehensive]. Nevertheless, existing multimodal benchmarks still exhibit significant limitations, as shown in Figure 1. Figure 1: Existing benchmarks have two limitations in evaluating multimodal browsing agents: 1. the semantic information of visual queries can be easily obtained through image search tools; 2. real-world browsing environments contain a wealth of multimodal information, which most benchmarks overlook. VisBrowse-Bench is designed to fuse multimodal information during the search process and ensure that visual capabilities are essential for completing the task. Specifically, most current benchmarks (e.g., MMSearch [mmsearch] and BrowseComp-VL [webwatcher]) merely test models’ ability to invoke tools for solving text-image queries. These tasks typically introduce an image search tool, where models simply feed the image into the tool for retrieval. Such tasks do not require fine-grained understanding of multimodal information and thus fail to sufficiently challenge models’ multimodal comprehension capabilities in deep research scenarios, instead primarily emphasizing tool-calling abilities. Moreover, even though some benchmarks (e.g., MMSearch-Plus [mmsearch-plus] and VDR-Bench [vdr]) require initial visual perception of text-image queries, the subsequent information gathering process degenerates into single-modal text traversal. Existing benchmarks structure their search space such that once the query image yields an entity name or caption, all downstream reasoning can be completed through textual document retrieval and synthesis. The search trajectory never necessitates grounding, parsing, or reasoning over additional visual information discovered during the search process. The task thus degenerates into text-only browsing, failing to assess whether models can dynamically acquire and integrate visual information when it is not provided upfront but must be actively sought across web pages containing multimodal information. This ability to conduct visual search and understanding throughout the task is vital in real-world retrieval, but existing benchmarks fail to evaluate this capability, limiting further development. Figure 3: (a) Overall performance of MLLMs on VisBrowse-Bench. (b) Performance of four MLLMs on seven categories. To address these challenges, we introduce VisBrowse-Bench, a challenging benchmark to comprehensively evaluate the reasoning and search capability of multimodal browsing agents. VisBrowse-Bench relies on two core design principles: the integration of multimodal information within the reasoning chain, and an inherent dependence on visual capabilities. We constructed our benchmark through a multi-stage, expert-guided pipeline, resulting in 169 VQA instances that cover seven distinct domains. Based on a seed entity with inherent visual ambiguity, domain experts recursively construct multi-hop reasoning chains, which require locating and reasoning novel visual evidence that cannot be paraphrased textually, alongside textual sources that provide complementary but insufficient information. Furthermore, we propose an efficient agentic workflow for visual reasoning and visual information retrieval during the search process. In this workflow, the agent is driven to actively perform visual reasoning and visual information retrieval using a rich set of tools. The preview performance of the MLLMs using our workflow on VisBrowse-Bench is shown in Figure 1. The best performance in VisBrowse-Bench is achieved by Claude-4.6-Opus, with an accuracy of 47.6%, while most models achieve an accuracy of around 30%. Our key contributions can be summarized as follows: • We formalize the task of multimodal browsing, identifying two critical challenges in existing benchmarks: insufficient evaluation of visual reasoning ability and the neglect of visual-native information in the reasoning chains. • We propose VisBrowse-Bench, a new benchmark comprising 169 rigorously validated instances constructed by human experts. This benchmark aims to jointly evaluate the search and visual reasoning capabilities of multimodal browsing agent systems. • We introduce a multi-turn visual information retrieval-reasoning agentic workflow to solve multimodal browsing problems in the real world. Compared to direct answer, the models’ performance are significantly improved using our workflow, but performance limitations still exist. The results demonstrate that the agentic workflow is effective and our benchmark is challenging. 2 Related Work 2.1 Multimodal Browsing Agents Early systems built upon LLMs demonstrated the feasibility of tool-augmented web navigation, employing search engines and web browsing tools to retrieve and integrate textual information in response to user queries searchr1; browseragent; webdancer; team2025tongyi; chang2025grail. These agent systems established the foundational architecture for iterative retrieval and reasoning, but remained fundamentally constrained by their inability to perceive and process visual content, reducing rich multimodal web environments to text-only representations. The advent of MLLMs drives a paradigm shift toward multimodal browsing. MMSearch-R1 mmsearchr1 is the first end-to-end reinforcement learning framework that drives MLLMs to perform multi-turn searches on demand in real-world internet environments. WebWatcher webwatcher can leverage more external tools and design a data synthesis pipeline to produce high-quality multimodal data for training. DeepMMSearch-R1 deepmmsearch, Skywork-R
2
0
4
AdaZoom-GUI achieves SOTA GUI grounding by adaptively zooming in on small elements and refining ambiguous instructions, outperforming even larger models.
Current multimodal browsing agents are surprisingly bad at using visual information on webpages, with even top models scoring below 50% accuracy on a new visual-native search benchmark.