EPFLOPPOSheffieldFeb 24, 2026arXiv:2602.21143

A Benchmark for Deep Information Synthesis

Debjit Paul, Debjit Paul, Daniel Murphy, D. Murphy, Milan Gritta, Milan Gritta, Ronald Cardenas, Ronald Cardenas, Victor Prokhorov, V. Prokhorov, L. Bolliger, Lena Sophia Bolliger, Aysim Toker, Aysim Toker, Roy Miles, Roy Miles, Andreea-Maria Oncescu, Andreea-Maria Oncescu, Jasivan Sivakumar, Jasivan Alex Sivakumar, Philipp Borchert, Philipp Borchert, Ismail Elezi, Ismail Elezi, Meiru Zhang, Meiru Zhang, Ka Yiu Lee, Guchun Zhang, Guchun Zhang, Jun Wang, Gerasimos Lampouras, Gerasimos Lampouras

AI Summary

The paper introduces DEEPSYNTH, a new benchmark to evaluate LLM-based agents on complex, real-world tasks requiring information synthesis from multiple sources and structured reasoning. The benchmark comprises 120 tasks across 7 domains and 67 countries, designed with a multi-stage data collection pipeline involving hypothesis generation, data analysis, and task design with verifiable answers. Evaluations of 11 state-of-the-art LLMs and deep research agents on DEEPSYNTH reveal poor performance (F1 score of 8.97 and LLM-judge score of 17.5), indicating the benchmark's difficulty and highlighting challenges in hallucination and reasoning over large information spaces.

Key Contribution

Current LLM agents are surprisingly bad at synthesizing information from multiple sources to solve realistic problems, achieving dismal scores on the new DEEPSYNTH benchmark.

Abstract

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Benchmark for Deep Information Synthesis

Related Papers