Stanford HAIApr 24, 2026arXiv:2604.22294

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Harshit Joshi, Priyank Shethia, Jadelynn Dao, Monica S. Lam

AI Summary

The paper introduces SLIDERS, a framework for question answering over long document collections that addresses the context window limitations of LLMs by extracting salient information into a relational database. SLIDERS uses SQL for scalable reasoning over this structured state and incorporates a data reconciliation stage to ensure global coherence by resolving inconsistencies and redundancies. Experiments on existing and new long-context benchmarks demonstrate that SLIDERS outperforms existing baselines, including GPT-4.1, by significant margins, especially on very large document sets.

Key Contribution

LLMs can't handle the truth: SLIDERS beats GPT-4.1 on long-context QA by sidestepping the context window entirely.

Abstract

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References114

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Related Papers