UWNortheasternNorthwesternFeb 24, 2026arXiv:2602.20459

PreScience: A Benchmark for Forecasting Scientific Contributions

Anirudh Ajith, Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Jay DeYoung, Nadav Kunievsky, Nadav Kunievsky, Austin C. Kozlowski, Austin C. Kozlowski, Oyvind Tafjord, Oyvind Tafjord, James Evans, Daniel S. Weld, Daniel S. Weld, Tom Hope, Tom Hope, Doug Downey, Doug Downey

AI Summary

The paper introduces PreScience, a benchmark for evaluating AI systems' ability to forecast scientific advancements by decomposing the research process into four tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. They curated a dataset of 98K AI papers with metadata and a graph of 502K related papers to facilitate these forecasting tasks. Experiments with LLMs on contribution generation, evaluated using their novel LACERScore, show that even advanced models struggle to match the diversity and novelty of human-authored research in a 12-month simulation.

Key Contribution

Even the most advanced LLMs fall short in simulating scientific progress, producing synthetic research corpora that lack the diversity and novelty of human-authored work.

Abstract

Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. PreScience is a carefully curated dataset of 98K recent AI-related research papers, featuring disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of companion author publication histories and citations spanning 502K total papers. We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement. We find substantial headroom remains in each task -- e.g. in contribution generation, frontier LLMs achieve only moderate similarity to the ground-truth (GPT-5, averages 5.6 on a 1-10 scale). When composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PreScience: A Benchmark for Forecasting Scientific Contributions

Related Papers