Tsinghua AISCUTShanghai AI LabSJTUTJUMay 28, 2026arXiv:2606.07591

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Zhangrui Zhao, Weijie Ma, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Lu Mi, Xuxuan Xie, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang

AI Summary

This paper introduces ResearchClawBench, a comprehensive benchmark designed to evaluate the end-to-end autonomous scientific research capabilities of AI coding agents across 40 tasks in 10 scientific domains. By grounding tasks in real published papers and utilizing expert-curated multimodal rubrics, the benchmark allows for a nuanced assessment of both re-discovery and new discovery in scientific research. Evaluation results reveal that current auto-research agents, including Claude Code and Claude-Opus-4.7, fall short of reliable re-discovery, highlighting significant areas for improvement in experimental and evidence alignment.

Key Contribution

Current AI agents struggle to reliably rediscover scientific knowledge, with top performers averaging only 21.5 out of a possible score, revealing critical gaps in their research capabilities.

Abstract

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Related Papers