Search papers, labs, and topics across Lattice.
This paper explores the use of Multimodal Large Language Models (MLLMs) as zero-shot similarity estimators for image retrieval by prompting them with image pairs and converting next-token probabilities into similarity scores. They demonstrate that MLLMs can effectively re-rank image retrieval candidates without task-specific training, outperforming specialized re-rankers in out-of-domain scenarios and exhibiting robustness to visual noise. The approach combines MLLMs with efficient indexing techniques to achieve scalability in large-scale retrieval pipelines.
MLLMs, without any training, can beat specialized models at image retrieval, especially when the target domain differs from the training data of those specialized models.
Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.