CASPKUFeb 24, 2026arXiv:2602.21035

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Junhao Xiao, Jun Xiao, Zhiyu Wu, Zhiyu Wu, Hao Lin, Yahui Liu, Yahui Liu, Xiaoran Zhao, Xiaoran Zhao, Zixu Wang, Zixu Wang, Zejiang He, Zejiang He

AI Summary

The paper addresses the challenge of Vision-Language Models (VLMs) like CLIP failing to accurately interpret negated visual descriptions. They introduce CLIPGlasses, a plug-and-play framework with a Lens module to disentangle negated semantics and a Frame module to predict context-aware repulsion. By integrating the repulsion strength into a modified similarity computation, CLIPGlasses reduces false positive matches and improves performance, especially in cross-domain and low-resource scenarios, without fine-tuning CLIP.

Key Contribution

CLIP can now understand "no dog" without any fine-tuning, thanks to a plug-and-play module that disentangles negated semantics and penalizes false positive matches.

Abstract

Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching"no dog"with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Related Papers