Search papers, labs, and topics across Lattice.
The paper introduces IdeaBench, a benchmark for evaluating LLMs in generating research ideas, comprising a dataset of 2,374 papers across eight domains and their 29,408 references. It profiles LLMs as domain-specific researchers grounded in contextual constraints to leverage their pre-trained knowledge for idea generation. The paper proposes a reference-based metric, aligned with human judgment, to quantify idea quality, revealing that LLMs excel at novelty but struggle with feasibility.
LLMs are great at dreaming up research ideas, but IdeaBench reveals they often lack a reality check, struggling with feasibility.
Large Language Models (LLMs) have revolutionized interactions between human and artificial intelligence (AI) systems, demonstrating state-of-the-art performance across various domains, including scientific discovery and hypothesis generation. However, the absence of a comprehensive and systematic evaluation framework for LLM-driven research idea generation hinders a rigorous understanding of their strengths and limitations. To address this gap, we propose IdeaBench, a benchmark system that provides a structured dataset and evaluation framework for standardizing the assessment of research idea generation by LLMs. Our dataset comprises titles and abstracts from 2,374 influential papers across eight research domains, along with their 29,408 referenced works, creating a context-rich environment that mirrors human researchers' ideation processes. By profiling LLMs as domain-specific researchers and grounding them in similar contextual constraints, we directly leverage the models' knowledge learned from the pre-training stage to generate new research ideas. To systematically evaluate LLMs' research ideation capability and approximate human assessment, we propose a reference-based metric that aligns with human judgment to quantify idea quality with the assistance of LLMs. Through this evaluation, we find that while LLMs excel at generating novel ideas, they may struggle with generating feasible ideas. IdeaBench serves as a critical resource for benchmarking and comparing LLMs, ultimately advancing research on AI's role in automating scientific discovery.