Search papers, labs, and topics across Lattice.
The paper introduces ColParse, a novel approach for visual document retrieval (VDR) that addresses the storage bottleneck of multi-vector architectures by parsing documents into layout-informed sub-image embeddings. These sub-image embeddings are fused with a global page-level vector to create a compact, structurally-aware multi-vector representation. Experiments show ColParse reduces storage by over 95% while improving performance across various benchmarks and base models, making fine-grained VDR more practical.
Shrinking visual document retrieval storage by 95% is now possible without sacrificing accuracy, thanks to a layout-aware parsing strategy.
Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.