Feb 1, 2026arXiv:2602.01193

Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation

AI Summary

This paper reviews Visual Word Sense Disambiguation (VWSD), a multimodal approach to resolving lexical ambiguity by incorporating visual cues. It surveys VWSD techniques from 2016-2025, covering feature-based, graph-based, and contrastive embedding methods, including recent advances using CLIP, diffusion models, and LLMs. The review highlights that fine-tuned CLIP-based models and LLM-enhanced VWSD systems achieve 6-8% MRR gains over zero-shot baselines, while also identifying challenges like contextual limitations, biases, and the need for multilingual datasets.

Key Contribution

Visual cues can significantly improve word sense disambiguation, boosting performance by 6-8% MRR, but biases and lack of multilingual data remain significant hurdles.

Abstract

This paper offers a mini review of Visual Word Sense Disambiguation (VWSD), which is a multimodal extension of traditional Word Sense Disambiguation (WSD). VWSD helps tackle lexical ambiguity in vision-language tasks. While conventional WSD depends only on text and lexical resources, VWSD uses visual cues to find the right meaning of ambiguous words with minimal text input. The review looks at developments from early multimodal fusion methods to new frameworks that use contrastive models like CLIP, diffusion-based text-to-image generation, and large language model (LLM) support. Studies from 2016 to 2025 are examined to show the growth of VWSD through feature-based, graph-based, and contrastive embedding techniques. It focuses on prompt engineering, fine-tuning, and adapting to multiple languages. Quantitative results show that CLIP-based fine-tuned models and LLM-enhanced VWSD systems consistently perform better than zero-shot baselines, achieving gains of up to 6-8\% in Mean Reciprocal Rank (MRR). However, challenges still exist, such as limitations in context, model bias toward common meanings, a lack of multilingual datasets, and the need for better evaluation frameworks. The analysis highlights the growing overlap of CLIP alignment, diffusion generation, and LLM reasoning as the future path for strong, context-aware, and multilingual disambiguation systems.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation

Related Papers