Search papers, labs, and topics across Lattice.
The paper introduces BRIDGE, a new benchmark for multi-hop question answering over long scientific papers containing text, tables, and figures. BRIDGE includes explicit multi-hop reasoning annotations, enabling step-level evaluation of reasoning beyond just answer accuracy. Experiments using BRIDGE reveal that current LLMs and multimodal RAG systems struggle with evidence aggregation and grounding in long, multimodal contexts, even when they achieve high answer accuracy.
Current LLM benchmarks hide critical reasoning failures in long, multimodal documents, which BRIDGE exposes through step-level evaluation.
Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures. The dataset supports both chain-like and fan-out structures and provides explicit multi-hop reasoning annotations for step-level evaluation beyond answer accuracy. Experiments with state-of-the-art LLMs and multimodal retrieval-augmented generation (RAG) systems reveal systematic deficiencies in evidence aggregation and grounding that remain hidden under conventional answer-only evaluation. BRIDGE provides a targeted testbed for diagnosing reasoning failures in long multimodal documents.