UNCMar 12, 2026arXiv:2603.12180

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Lukasz Borchmann, Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, M. Turski, Shreyansh Padarha, Ryan Othniel Kearns, R. Kearns, Adam Mahdi, N. Rogge, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Artemis Llabr'es, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta

AI Summary

The paper introduces MADQA, a new benchmark dataset of 2,250 questions grounded in 800 PDF documents, designed to evaluate strategic reasoning in multimodal agents navigating document collections. They propose a novel evaluation protocol that measures the accuracy-effort trade-off to distinguish between strategic navigation and brute-force search. Experiments reveal that while current agents can achieve human-level accuracy on some questions, they rely heavily on inefficient search strategies and fail to match oracle performance, indicating a lack of genuine strategic planning.

Key Contribution

Current multimodal agents navigating document collections achieve human-level accuracy through brute-force search, highlighting a critical gap in strategic reasoning and planning.

Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...