Search papers, labs, and topics across Lattice.
The paper investigates query-agnostic compression techniques for multi-vector document representations used in late-interaction information retrieval across modalities (text, image, video). It addresses the linear scaling of computation and storage costs with document length in multi-vector retrieval by proposing and evaluating four compression methods. The key result is that the novel Attention-Guided Clustering (AGC) method, which uses attention to identify salient regions for cluster centroids and token aggregation, consistently outperforms other parameterized compression techniques and achieves competitive or improved retrieval performance compared to uncompressed indexes across BEIR, ViDoRe, MSR-VTT, and MultiVENT 2.0 datasets.
Attention-guided clustering slashes the storage costs of multi-vector document representations for retrieval across text, images, and video, often *improving* performance compared to uncompressed indexes.
We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.