Search papers, labs, and topics across Lattice.
The paper introduces LiteraryQA, a high-quality subset of the NarrativeQA dataset, designed to improve the evaluation of long-document narrative question answering systems by focusing on literary works and mitigating issues of noisy data. They employ a human- and LLM-validated pipeline to refine QA pairs and source documents, resulting in a cleaner benchmark. Their meta-evaluation of automatic metrics demonstrates the inadequacy of n-gram-based metrics and the potential of LLM-as-a-Judge evaluations for assessing system performance on LiteraryQA.
NarrativeQA's reign as the go-to benchmark for long-document QA is over: LiteraryQA, a meticulously curated subset, reveals that LLM-as-a-Judge metrics align with human judgment far better than traditional n-gram methods.
Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA. This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/SapienzaNLP/LiteraryQA.