Department of Radiology & BiomedicalJun 23, 2026arXiv:2606.24530

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Yuru Wang, Lejun Cheng, Yuxin Zuo, Sihang Zeng, Bingxiang He, Che Jiang, Junlin Yang, Yuchong Wang, Kaikai Zhao, Weifeng Huang, Kai Tian, Zhenzhao Yuan, Jincheng Zhong, Weizhi Wang, Ning Ding, Bowen Zhou, Kaiyan Zhang

AI Summary

This paper introduces NatureBench, a benchmark comprising 90 tasks derived from Nature-family publications, aimed at assessing the capabilities of AI coding agents in scientific discovery. The study evaluates ten advanced agent configurations under a strict protocol, revealing that the best-performing model only achieves state-of-the-art results on 17.8% of tasks, primarily through translating scientific problems into supervised learning formats rather than through innovative scientific reasoning. The findings highlight significant limitations in current AI agents, particularly in method selection and computational resource allocation, rather than in their understanding of the tasks themselves.

Key Contribution

AI coding agents excel at translating scientific tasks into familiar formats but struggle to achieve true scientific discovery, with only 17.8% surpassing state-of-the-art benchmarks.

Abstract

We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Related Papers