UPennUVAAug 3, 2025

IdeaBench: Benchmarking Large Language Models for Research Idea Generation

Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, Aidong Zhang

AI Summary

The paper introduces IdeaBench, a benchmark for evaluating LLMs in generating research ideas, comprising a dataset of 2,374 papers across eight domains and their 29,408 references. It profiles LLMs as domain-specific researchers grounded in contextual constraints to leverage their pre-trained knowledge for idea generation. The paper proposes a reference-based metric, aligned with human judgment, to quantify idea quality, revealing that LLMs excel at novelty but struggle with feasibility.

Key Contribution

LLMs are great at dreaming up research ideas, but IdeaBench reveals they often lack a reality check, struggling with feasibility.

Abstract

Large Language Models (LLMs) have revolutionized interactions between human and artificial intelligence (AI) systems, demonstrating state-of-the-art performance across various domains, including scientific discovery and hypothesis generation. However, the absence of a comprehensive and systematic evaluation framework for LLM-driven research idea generation hinders a rigorous understanding of their strengths and limitations. To address this gap, we propose IdeaBench, a benchmark system that provides a structured dataset and evaluation framework for standardizing the assessment of research idea generation by LLMs. Our dataset comprises titles and abstracts from 2,374 influential papers across eight research domains, along with their 29,408 referenced works, creating a context-rich environment that mirrors human researchers' ideation processes. By profiling LLMs as domain-specific researchers and grounding them in similar contextual constraints, we directly leverage the models' knowledge learned from the pre-training stage to generate new research ideas. To systematically evaluate LLMs' research ideation capability and approximate human assessment, we propose a reference-based metric that aligns with human judgment to quantify idea quality with the assistance of LLMs. Through this evaluation, we find that while LLMs excel at generating novel ideas, they may struggle with generating feasible ideas. IdeaBench serves as a critical resource for benchmarking and comparing LLMs, ultimately advancing research on AI's role in automating scientific discovery.

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations27

Influential citations2

References33

Year2025

VenueKnowledge Discovery and Data Mining

Related Papers

Finding related papers...

Search

IdeaBench: Benchmarking Large Language Models for Research Idea Generation

Related Papers