MelbourneUWAMar 9, 2026arXiv:2603.07931

BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

Biao Xiang, Soyeon Caren Han, Yihao Ding

AI Summary

The paper introduces BRIDGE, a new benchmark for multi-hop question answering over long scientific papers containing text, tables, and figures. BRIDGE includes explicit multi-hop reasoning annotations, enabling step-level evaluation of reasoning beyond just answer accuracy. Experiments using BRIDGE reveal that current LLMs and multimodal RAG systems struggle with evidence aggregation and grounding in long, multimodal contexts, even when they achieve high answer accuracy.

Key Contribution

Current LLM benchmarks hide critical reasoning failures in long, multimodal documents, which BRIDGE exposes through step-level evaluation.

Abstract

Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures. The dataset supports both chain-like and fan-out structures and provides explicit multi-hop reasoning annotations for step-level evaluation beyond answer accuracy. Experiments with state-of-the-art LLMs and multimodal retrieval-augmented generation (RAG) systems reveal systematic deficiencies in evidence aggregation and grounding that remain hidden under conventional answer-only evaluation. BRIDGE provides a targeted testbed for diagnosing reasoning failures in long multimodal documents.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

Related Papers