Search papers, labs, and topics across Lattice.
This paper introduces GRIP, a feedback-guided retrieval framework designed to enhance Multimodal In-Context Learning (M-ICL) by identifying the most beneficial examples for Large Multimodal Models (LMMs). Through a systematic analysis, the authors reveal that traditional similarity-based approaches often fail to select the most effective in-context examples, leading to suboptimal performance. GRIP employs contrastive training to refine retrieval processes, demonstrating significant improvements in classification, captioning, and visual question answering tasks, particularly excelling in classification on the Idefics2-8B dataset.
Retrieving the right prompts can boost LMM performance by up to 30%, challenging the assumption that similarity guarantees effectiveness in in-context learning.
In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance. To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.