Search papers, labs, and topics across Lattice.
The paper introduces SLQ, a method to adapt frozen Multimodal Large Language Models (MLLMs) for retrieval by using a small set of learnable "Shared Latent Queries" appended to both text and image inputs. These queries act as global aggregation interfaces, producing compact embeddings in a unified space without modifying the MLLM's parameters. SLQ outperforms full fine-tuning and LoRA on standard datasets and demonstrates significant improvements on a new knowledge-aware reasoning retrieval benchmark (KARR-Bench), suggesting it better preserves pre-trained knowledge.
Freezing your MLLM and training only a handful of shared latent queries can beat full fine-tuning for multimodal retrieval, especially when reasoning is involved.
Multimodal Large Language Models (MLLMs) exhibit strong reasoning and world knowledge, yet adapting them for retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. In this work, we argue that adapting MLLMs for retrieval should focus on eliciting pre-trained representations rather than overwriting them. To this end, we propose SLQ, an effective and efficient framework that adapts a frozen MLLM into a retriever through a small set of Shared Latent Queries. Appended to the end of both text and image token sequences, these queries leverage the model's native causal attention to serve as global aggregation interfaces, producing compact embeddings in a unified space while keeping the backbone unchanged. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench. The results demonstrate that SLQ, which preserves pre-trained representations, provides an effective and efficient framework for adapting MLLMs to retrieval.