Jun 5, 2026arXiv:2606.07689

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Zheng Lian, Hao Wu, Yuan Gao, Xinyu Geng, Xin Wang, P. Heng

AI Summary

This paper introduces Struct-Searcher, a novel agentic workflow designed for multimodal deep information seeking that leverages belief revision theory to manage conflicting information across diverse modalities. By maintaining an evolving multimodal structural graph, Struct-Searcher enhances the reasoning process, allowing for more effective integration of heterogeneous data sources. Experimental results show that Struct-Searcher outperforms existing state-of-the-art vision-language models and deep research agents, achieving an average accuracy improvement of 17.2% across various benchmark datasets.

Key Contribution

Struct-Searcher achieves a remarkable 17.2% accuracy boost in multimodal information seeking by effectively managing conflicting evidence through a dynamic structural graph.

Abstract

Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based information seeking to multimodal settings. However, existing agentic workflows are largely aligned with evidence accumulation models, which linearly aggregate evidence and lack principled mechanisms for handling contradictory information across heterogeneous modalities. Towards this end, we propose Struct-Searcher, a structural agentic workflow grounded in belief revision theory that explicitly maintains an evolving multimodal structural graph throughout the reasoning process, enabling effective conflict-aware multimodal deep information seeking. Extensive experiments across multiple benchmark datasets and backbone models demonstrate that Struct-Searcher is (1) plug-and-play and model-agnostic, yielding an average relative accuracy improvement of 17.2% on BrowseComp-VL across five different backbones. (2) top-performing, consistently outperforming state-of-the-art vision-language models (VLMs) and deep research agents, with relative accuracy improvements of 3.7% on MM-BrowseComp, 1.5% on HLE-VL, and 0.7% on BrowseComp-VL over the second-best competing approach.

Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...