MacquarieMeituanPKUUNSWZJUApr 27, 2026arXiv:2604.24564

MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG

Xihang Wang, Zihan Wang, Chengkai Huang, Chengkai Huang, Quan Z. Sheng, Quan Z. Sheng, Lina Yao, Lina Yao

AI Summary

This paper introduces Multi-modal Evidence Grounding (MEG), a novel metric that quantifies the semantic contribution of retrieved multimodal evidence in Retrieval-Augmented Generation (RAG) systems by focusing on high-IDF tokens. Based on MEG, they propose MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. Experiments on the M$^2$RAG benchmark demonstrate that MEG-RAG outperforms strong baselines by prioritizing high-value content based on semantic grounding, improving accuracy and multimodal consistency.

Key Contribution

Semantic grounding, not token probability, is the key to better multimodal RAG.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. By prioritizing high-value content based on semantic grounding rather than token probability distributions, MEG-RAG improves the accuracy and multimodal consistency of generated outputs. Extensive experiments on the M$^2$RAG benchmark show that MEG-RAG consistently outperforms strong baselines and demonstrates robust generalization across different teacher models.

Multimodal Models Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References17

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG

Related Papers