BaiduApr 28, 2026arXiv:2604.25273

Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

Guosheng Zhang, Linkai Liu, Keyao Wang, Haixiao Yue, Zhiwen Tan, Xiao Tan

AI Summary

This paper introduces Salient Subject-Aware Multimodal Embedding (SSA-ME) to address visual neglect and semantic drift in large multimodal models (LMMs) for cross-modal retrieval. SSA-ME identifies salient visual concepts using LMMs and visual experts, then aligns cross-modal attention with these regions using a saliency-guided objective. Experiments on the MMEB benchmark demonstrate state-of-the-art performance, indicating that subject-level modeling significantly improves multimodal retrieval accuracy and interpretability.

Key Contribution

LMMs struggle to ground text queries in the right parts of images, but explicitly modeling salient visual subjects can dramatically improve cross-modal retrieval.

Abstract

Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emphasize salient visual concepts in image-text pairs, and introduces a saliency-guided objective to better align cross-modal attention with semantically meaningful regions. Additionally, a feature regeneration module recalibrates visual features based on the derived saliency maps, ensuring a balanced and semantically coherent integration across modalities. Extensive experiments show that our method achieves state-of-the-art performance on the MMEB benchmark, demonstrating that incorporating subject-level modeling substantially improves multimodal retrieval. Comprehensive qualitative analyses further illustrate the interpretability and effectiveness of our approach.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

Related Papers