Search papers, labs, and topics across Lattice.
This paper introduces ForeSci, a benchmark designed to assess the ability of LLM agents to make forward-looking research judgments based on historical evidence across various AI domains. By structuring 500 tasks with controlled knowledge bases that exclude future evidence, the study evaluates the performance of native LLMs, Hybrid RAG, and adaptations of research agents, revealing that while explicit evidence organization enhances traceability, the effectiveness varies significantly by decision family. Notably, the findings highlight a critical disconnect where agents often reference relevant evidence but fail to accurately predict the appropriate research direction.
Agents can cite relevant evidence yet still misjudge the future direction of research, revealing a fundamental flaw in current decision-making systems.
AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.