Jun 3, 2025arXiv:2506.03090

Literary Evidence Retrieval via Long-Context Language Models

AI Summary

This paper investigates the ability of long-context language models to understand literary fiction by introducing a literary evidence retrieval task based on the RELiC dataset, where models must generate missing quotations from a primary source text given literary criticism. They curate a high-quality subset of 292 examples and benchmark several models, finding that Gemini Pro 2.5 surpasses human expert performance, while open-weight models lag significantly behind. The analysis reveals that even the strongest models struggle with nuanced literary signals and overgeneration, indicating areas for improvement in applying LLMs to literary analysis.

Key Contribution

Gemini Pro 2.5 can beat human experts at literary evidence retrieval, but even it struggles with nuanced literary signals, suggesting LLMs still have a long way to go in literary analysis.

Abstract

How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. We curate a high-quality subset of 292 examples through extensive filtering and human verification. Our experiments show that recent reasoning models, such as Gemini Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a wide gap in interpretive reasoning between open and closed-weight models. Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis. We release our dataset and evaluation code to encourage future work in this direction.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations1

Influential citations0

References24

Year2025

VenueAnnual Meeting of the Association for Computational Linguistics

Related Papers

Finding related papers...

Search

Literary Evidence Retrieval via Long-Context Language Models

Related Papers