Search papers, labs, and topics across Lattice.
This paper introduces Visual-Seeker, a visual-native multimodal deep search agent that enhances factual grounding in complex, open-world scenarios through active visual reasoning. By dynamically attending to fine-grained visual details and harvesting evidence during the search process, Visual-Seeker overcomes limitations of existing methods that rely on static images and text-only evidence. Extensive experiments show that it achieves state-of-the-art performance across five multimodal search benchmarks, outperforming several proprietary models and demonstrating its effectiveness in real-world applications.
Visual-Seeker outperforms proprietary models by actively engaging with visual details, redefining multimodal search capabilities.
Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.