CASChongqingState Key Laboratory of AI SafetyUniversity of CaliforniaUQJun 17, 2026arXiv:2606.18781

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

Shanshan Lyu, Yiwei Wang, Yujun Cai, Jiafeng Guo, Shenghua Liu

AI Summary

This paper addresses the limitations of dense retrieval systems when handling long documents by introducing the Evidence Dilution Index (EDI) to quantify the loss of crucial information during document encoding. The authors propose DICE (Document Inference via Chunk Evidence), a method that independently encodes document chunks and aggregates them into a single vector, significantly enhancing retrieval performance. Experimental results on the LongEmbed dataset show substantial improvements in retrieval accuracy, particularly for documents exceeding 4,000 tokens, with DICE outperforming traditional single-vector approaches in 92.8% of cases.

Key Contribution

DICE transforms long-document retrieval by effectively preserving critical information from chunks, achieving up to a 60% increase in retrieval accuracy for documents over 4,000 tokens.

Abstract

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

Related Papers