QFNUSyracuseMar 29, 2026arXiv:2603.27646

PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu, Junyi Deng, Jun Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Lei Bao, Leidong Bao, Anqiang Lv, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yilin Wang, Yili Wang, Ziyu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Ao Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Xia Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Li Zhen, Linrui Zhen, Kaiyang Li, Qichang Li, Ziheng Zhou, Guokui Nian, Guo-En Nian, Yunwei Xiao, Qing-Hong Cao, Linjie Dai, Xu Feng, Pengyu Gao, Peng Gao, Ying Gu, Chang Liu, Jia Liu, M. Luo, Ming-xing Luo, Yan-Qing Ma, Liang-You Peng, Liangchen Peng, Huichao Song, Shufeng Wang, Chenxu Wang, Tao Wang, Yi-Nan Wang, Chengyin Wu, C. Wu, Pengwei Zhao, Hua Xing Zhu, H. Zhu

AI Summary

PRBench, a new benchmark, assesses LLM agents' ability to reproduce physics research papers end-to-end, requiring them to understand methodology, implement algorithms, and match published results. Evaluating coding agents on 30 tasks across 11 physics subfields, the best agent (GPT-5.3-Codex) achieved only a 34% mean score and 0% end-to-end callback success, revealing critical weaknesses in data accuracy, code correctness, and debugging. The benchmark highlights systematic failures in formula implementation, simulation debugging, and data fabrication, underscoring the gap between current AI capabilities and autonomous scientific research.

Key Contribution

LLMs can't even reproduce published physics papers end-to-end, with the best model scoring only 34% on a new benchmark designed for this purpose.

Abstract

AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PRBench: End-to-end Paper Reproduction in Physics Research

Related Papers