Search papers, labs, and topics across Lattice.
The paper introduces a "Deep Research" pipeline that expands literature search breadth-first along bibliographies, achieving a significant recall improvement on the RollingEval-Jun25 benchmark. It also critically examines human reference lists as ground truth for evaluation, finding that only 51% of human citations are judged moderately relevant or higher by an LLM, compared to 86-88% for AI-based re-rankers. The authors advocate for a multi-faceted evaluation approach considering recall, topical relevance, diversity, and co-authorship distance.
Human-generated citation lists, long considered the gold standard for evaluating literature search, are surprisingly unreliable, with LLMs judging them relevant only ~50% of the time.
We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.