Tsinghua AIFudanShanghai Key Laboratory of MultimodalAug 21, 2025arXiv:2508.15658

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Xuanyi Chen, Jiaxin Mao, Ziyi Ye, Yiqun Liu

AI Summary

The paper introduces SurGE, a new benchmark for scientific survey generation in computer science, comprising test instances with topic descriptions, expert-written surveys, cited references, and a large-scale academic corpus. It proposes an automated evaluation framework assessing comprehensiveness, citation accuracy, structural organization, and content quality of generated surveys. Experiments using SurGE reveal a performance gap in LLM-based methods, even agentic frameworks, indicating the challenge of automated survey generation.

Key Contribution

LLMs still struggle to synthesize coherent scientific surveys, as evidenced by a new benchmark revealing significant performance gaps even with advanced agentic frameworks.

Abstract

The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations3

Influential citations0

References88

Year2025

VenueN/A

Related Papers

Finding related papers...

Search

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Related Papers