BAAIApr 28, 2026arXiv:2604.25256

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Lei Xiong, Kun Luo, Ziyi Xia, Wenbo Zhang, Jin-Ge Yao, Zheng Liu, Jing Shao, Jingying Shao, Jianlyu Chen, Hongjin Qian, Xi Yang, X. Yang, Qian Yu, Hao Li, C. Yue, Chen Yue, Xia'an Du, Yuyang Wang, Yesheng Liu, Haiyu Xu, Zhicheng Dou

AI Summary

The paper introduces AutoResearchBench, a new benchmark designed to evaluate AI agents' ability to autonomously discover relevant scientific literature. It features two tasks: Deep Research (finding a specific paper through multi-step probing) and Wide Research (comprehensively collecting papers satisfying given conditions). Results show that even strong LLMs struggle, achieving only ~9% accuracy/IoU, highlighting the benchmark's challenge and the gap in current AI agent capabilities for research-oriented tasks.

Key Contribution

LLMs that ace general web browsing still fail miserably at autonomous scientific literature discovery, revealing a critical gap in research-oriented AI agent capabilities.

Abstract

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents'capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging. Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction. We publicly release the dataset, evaluation pipeline, and code at https://github.com/CherYou/AutoResearchBench.

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Related Papers