SapienzaOct 15, 2025arXiv:2510.13494

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Tommaso Bonomo, Luca Gioffr'e, Roberto Navigli

AI Summary

The paper introduces LiteraryQA, a high-quality subset of the NarrativeQA dataset, designed to improve the evaluation of long-document narrative question answering systems by focusing on literary works and mitigating issues of noisy data. They employ a human- and LLM-validated pipeline to refine QA pairs and source documents, resulting in a cleaner benchmark. Their meta-evaluation of automatic metrics demonstrates the inadequacy of n-gram-based metrics and the potential of LLM-as-a-Judge evaluations for assessing system performance on LiteraryQA.

Key Contribution

NarrativeQA's reign as the go-to benchmark for long-document QA is over: LiteraryQA, a meticulously curated subset, reveals that LLM-as-a-Judge metrics align with human judgment far better than traditional n-gram methods.

Abstract

Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA. This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/SapienzaNLP/LiteraryQA.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References49

Year2025

VenueConference on Empirical Methods in Natural Language Processing

Related Papers

Finding related papers...

Search

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Related Papers