Mar 13, 2026arXiv:2603.12824

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

AI Summary

NanoVDR addresses the inefficiency of using large VLMs for visual document retrieval by decoupling document indexing and query encoding. A 2B parameter VLM indexes documents offline, while a distilled text-only student model (as small as 69M parameters) encodes queries at inference time. Through systematic comparison of distillation objectives, pointwise cosine alignment on query text is found to be the most effective, and cross-lingual transfer is improved via machine-translated query augmentation.

Key Contribution

Shrinking a 2B vision-language retriever to a 70M text-only model achieves 95% of the original quality and outperforms a 2B baseline, while slashing query latency by 50x.

Abstract

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32times fewer parameters and 50times lower CPU query latency, at a total training cost under 13 GPU-hours.

Inference & Quantization Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Related Papers