MacquarieMeituanUNSWZJUApr 30, 2026arXiv:2604.27600

Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

Xihang Wang, Zihan Wang, Chengkai Huang, Cao Liu, Ke Zeng, Quan Z. Sheng, Lina Yao

AI Summary

This paper introduces Fragment-level Evidence Selection for RAG (FES-RAG), a novel framework that addresses the limitations of existing Multimodal Retrieval-Augmented Generation (MRAG) methods by selecting atomic multimodal fragments instead of entire documents as grounding evidence. FES-RAG decomposes retrieved documents into sentence-level textual and region-level visual fragments and uses Fragment Information Gain (FIG) to measure the marginal contribution of each fragment to the MLLM's generation confidence. Experiments on the M2RAG benchmark demonstrate that FES-RAG outperforms state-of-the-art document-level MRAG methods, achieving up to a 27% relative improvement in CIDEr while reducing context length.

Key Contribution

Stop drowning your MLLMs in irrelevant context: FES-RAG shows that carefully selecting multimodal fragments boosts factual accuracy by up to 27% and slashes context length.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) is widely adopted for Multimodal Large Language Models (MLLMs) with external evidence to reduce hallucinations. Despite its success, most existing MRAG frameworks treat retrieved evidence as indivisible documents, implicitly assuming that all content within a document is equally informative. In practice, however, sometimes only a small fraction of a document is relevant to a given query, while the remaining content introduces substantial noise that may lead to performance degradation. We address this fundamental limitation by reframing MRAG as a fine-grained evidence selection problem. We propose Fragment-level Evidence Selection for RAG (FES-RAG), a framework that selects atomic multimodal fragments rather than entire documents as grounding evidence. FES-RAG decomposes retrieved multimodal documents into sentence-level textual fragments and region-level visual fragments, enabling precise identification of evidence that directly supports generation. To guide fragment selection, we introduce Fragment Information Gain (FIG), a principled metric that measures the marginal contribution of each fragment to the MLLM's generation confidence. Based on FIG, we distill fragment-level utility judgments from a high-capacity MLLM into a lightweight selector, achieving accurate evidence selection with low inference overhead. Experiments on the M2RAG benchmark show that FES-RAG consistently outperforms state-of-the-art document-level MRAG methods, achieving up to 27 percent relative improvement in CIDEr. By selecting fewer yet more informative fragments, our approach substantially reduces context length while improving factual accuracy and generation coherence.

Multimodal Models Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

Related Papers