ZJUSep 21, 2025arXiv:2509.16952

AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation

Tiancheng Huang, Ruisheng Cao, Yuxin Zhang, Zhangyi Kang, Zijian Wang, Chenrun Wang, Yijie Luo, Hang Zheng, Lirong Qian, Lu Chen, Kai Yu

AI Summary

The authors introduce AirQA, a new question answering dataset comprising 13,948 AI papers and 1,246 questions designed for multi-task, multi-modal, and instance-level evaluation of LLMs on scientific documents. They also propose ExTrActor, an automated framework leveraging LLMs to synthesize instruction data for training interactive agents. Experiments demonstrate that existing models struggle on AirQA, and that ExTrActor can improve the multi-turn tool-use capabilities of smaller models.

Key Contribution

LLMs still struggle to answer questions about AI research papers, as evidenced by a new challenging dataset, AirQA, which also comes with an automated method for synthesizing training data to improve performance.

Abstract

The growing volume of academic papers has made it increasingly difficult for researchers to efficiently extract key information. While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities. Moreover, training an interactive agent for this specific task is hindered by the shortage of high-quality interaction trajectories. In this work, we propose AirQA, a human-annotated comprehensive paper QA dataset in the field of artificial intelligence (AI), with 13,948 papers and 1,246 questions, that encompasses multi-task, multi-modal and instance-level evaluation. Furthermore, we propose ExTrActor, an automated framework for instruction data synthesis. With three LLM-based agents, ExTrActor can perform example generation and trajectory collection without human intervention. Evaluations of multiple open-source and proprietary models show that most models underperform on AirQA, demonstrating the quality of our dataset. Extensive experiments confirm that ExTrActor consistently improves the multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger ones.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References34

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation

Related Papers