Search papers, labs, and topics across Lattice.
This paper introduces ELVA, a novel framework that addresses grain blindness in Universal Multimodal Retrieval (UMR) by treating negative samples differently based on their similarity to positive samples. By extending Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks and utilizing rule-based rewards, ELVA optimizes the ranking of negative samples and enhances the model's ability to capture grain-level information. The framework achieves state-of-the-art performance on standard retrieval benchmarks and a significant 13.1% improvement on the newly proposed MRBench, highlighting its effectiveness in complex query scenarios.
Treating negative samples based on their similarity to positives leads to a 13.1% boost in retrieval performance, revealing the critical role of grain-level information.
Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.