Mar 15, 2026arXiv:2603.14257

Automatic Inter-document Multi-hop Scientific QA Generation

Seungmin Lee, Dongha Kim, Yuni Jeon, Junyoung Koh, Min Song

AI Summary

The authors introduce AIM-SciQA, an automated framework leveraging LLMs and embedding-based semantic alignment to generate multi-document, multi-hop scientific question-answering datasets. This framework extracts single-hop QAs using machine reading comprehension and constructs cross-document relations, optionally incorporating citation information. Applied to a large corpus of PubMed Central papers, AIM-SciQA generates a substantial dataset of both single-hop and multi-hop QAs, which is shown to be factually consistent and effective for benchmarking retrieval-augmented scientific reasoning.

Key Contribution

Forget synthetic benchmarks—now you can evaluate scientific reasoning with a realistic, interpretable, multi-hop QA dataset automatically generated from PubMed Central.

Abstract

Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset's validity and generality.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Automatic Inter-document Multi-hop Scientific QA Generation

Related Papers