Search papers, labs, and topics across Lattice.
This paper introduces Prune-then-Merge, a two-stage framework for efficient multi-vector visual document retrieval that combines adaptive pruning to remove low-information patches with hierarchical merging to compress the remaining embeddings. By first pruning noisy features before merging, the method avoids feature dilution and achieves better compression rates without sacrificing retrieval performance. Experiments across 29 VDR datasets demonstrate that Prune-then-Merge outperforms existing methods, achieving a better trade-off between compression rate and feature fidelity.
Multi-vector visual document retrieval gets a speed boost without sacrificing accuracy thanks to a novel "Prune-then-Merge" approach that intelligently compresses visual features.
Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.