Barnard CollegeColumbiaNYUJun 9, 2026arXiv:2606.10460

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Haonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu

AI Summary

LakeQA is a novel benchmark designed to evaluate search-centric question answering (QA) capabilities over a vast data lake comprising 9.5 TB of text from diverse sources. It emphasizes the necessity of both searching for relevant documents and performing multi-hop reasoning to derive answers, reflecting the complexities of real-world information retrieval. Experimental results reveal that even advanced models like GPT-5.2 struggle with this benchmark, achieving only an 18.37% exact-match score, highlighting the challenges inherent in combining search and reasoning tasks.

Key Contribution

Even state-of-the-art LLMs like GPT-5.2 falter in LakeQA, scoring just 18.37% on a benchmark that demands both searching and multi-hop reasoning.

Abstract

Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Related Papers