HKUSTHuaweiFeb 23, 2026arXiv:2602.19549

Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

Yibo Yan, Yibo Yan, Mingdong Ou, Mingdong Ou, Yiqun Cao, Yi Cao, Xin Zou, Jiahao Huo, Jiahao Huo, Shuliang Liu, Shuliang Liu, James Kwok, James T. Kwok, Xuming Hu, Xuming Hu

AI Summary

This paper introduces Prune-then-Merge, a two-stage framework for efficient multi-vector visual document retrieval that combines adaptive pruning to remove low-information patches with hierarchical merging to compress the remaining embeddings. By first pruning noisy features before merging, the method avoids feature dilution and achieves better compression rates without sacrificing retrieval performance. Experiments across 29 VDR datasets demonstrate that Prune-then-Merge outperforms existing methods, achieving a better trade-off between compression rate and feature fidelity.

Key Contribution

Multi-vector visual document retrieval gets a speed boost without sacrificing accuracy thanks to a novel "Prune-then-Merge" approach that intelligently compresses visual features.

Abstract

Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.

Inference & Quantization Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References102

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

Related Papers