Search papers, labs, and topics across Lattice.
The paper introduces PreScience, a benchmark for evaluating AI systems' ability to forecast scientific advancements by decomposing the research process into four tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. They curated a dataset of 98K AI papers with metadata and a graph of 502K related papers to facilitate these forecasting tasks. Experiments with LLMs on contribution generation, evaluated using their novel LACERScore, show that even advanced models struggle to match the diversity and novelty of human-authored research in a 12-month simulation.
Even the most advanced LLMs fall short in simulating scientific progress, producing synthetic research corpora that lack the diversity and novelty of human-authored work.
Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. PreScience is a carefully curated dataset of 98K recent AI-related research papers, featuring disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of companion author publication histories and citations spanning 502K total papers. We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement. We find substantial headroom remains in each task -- e.g. in contribution generation, frontier LLMs achieve only moderate similarity to the ground-truth (GPT-5, averages 5.6 on a 1-10 scale). When composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.