Ant Digital TechnologiesGroupInstitute of AutomationJun 13, 2026arXiv:2606.15231

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Zhengbo Zhang, Changtao Miao, Jinbo Su, Zhaowen Zhou, Chunxia Zhang, Xukai Wang, Ruiqi Liu, Kaiyuan Zheng, Jiansheng Cai, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan

AI Summary

This paper introduces Visual-Seeker, a visual-native multimodal deep search agent that enhances factual grounding in complex, open-world scenarios through active visual reasoning. By dynamically attending to fine-grained visual details and harvesting evidence during the search process, Visual-Seeker overcomes limitations of existing methods that rely on static images and text-only evidence. Extensive experiments show that it achieves state-of-the-art performance across five multimodal search benchmarks, outperforming several proprietary models and demonstrating its effectiveness in real-world applications.

Key Contribution

Visual-Seeker outperforms proprietary models by actively engaging with visual details, redefining multimodal search capabilities.

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...