Apr 14, 2026arXiv:2604.13268

Indexing Multimodal Language Models for Large-scale Image Retrieval

Bahey Tharwat, Bahey Tharwat, Giorgos Kordopatis-Zilos, Giorgos Kordopatis-Zilos, Pavel Suma, Pavel Šuma, Ian Reid, Ian Reid, Giorgos Tolias

AI Summary

This paper explores the use of Multimodal Large Language Models (MLLMs) as zero-shot similarity estimators for image retrieval by prompting them with image pairs and converting next-token probabilities into similarity scores. They demonstrate that MLLMs can effectively re-rank image retrieval candidates without task-specific training, outperforming specialized re-rankers in out-of-domain scenarios and exhibiting robustness to visual noise. The approach combines MLLMs with efficient indexing techniques to achieve scalability in large-scale retrieval pipelines.

Key Contribution

MLLMs, without any training, can beat specialized models at image retrieval, especially when the target domain differs from the training data of those specialized models.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Indexing Multimodal Language Models for Large-scale Image Retrieval

Related Papers