Corresponding authorXiaomi IncXJTUJun 18, 2026arXiv:2606.20280

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

Yuhan Liu, Pei Fu, Hang Li, Yukun Qi, Chao Jiang, Jingwen Fu, Zhen Liu, Bin Qin, Zhenbo Luo, Jian Luan, Jingmin Xin

AI Summary

This paper introduces ELVA, a novel framework that addresses grain blindness in Universal Multimodal Retrieval (UMR) by treating negative samples differently based on their similarity to positive samples. By extending Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks and utilizing rule-based rewards, ELVA optimizes the ranking of negative samples and enhances the model's ability to capture grain-level information. The framework achieves state-of-the-art performance on standard retrieval benchmarks and a significant 13.1% improvement on the newly proposed MRBench, highlighting its effectiveness in complex query scenarios.

Key Contribution

Treating negative samples based on their similarity to positives leads to a 13.1% boost in retrieval performance, revealing the critical role of grain-level information.

Abstract

Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.

Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

Related Papers