Amazon ScienceFeb 21, 2026arXiv:2602.18940

DREAM: Deep Research Evaluation with Agentic Metrics

E. Avraham, R. Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah R. Flynn, Elman Mansimov, Aditya Kalyanpur, Ron Litman

AI Summary

The paper introduces DREAM, a framework for evaluating deep research agents that addresses the limitations of static evaluators in assessing temporal validity and factual correctness. DREAM achieves this by employing an agentic evaluation protocol that combines query-agnostic metrics with adaptive metrics generated by a tool-calling agent. Experiments demonstrate that DREAM is more sensitive to factual and temporal decay compared to existing benchmarks, providing a scalable and reference-free evaluation approach.

Key Contribution

Static benchmarks can be fooled by fluent text and aligned citations, but DREAM leverages agentic evaluation to expose the critical capability mismatch in assessing temporal validity and factual correctness of research agents.

Abstract

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DREAM: Deep Research Evaluation with Agentic Metrics

Related Papers