Mar 30, 2026arXiv:2603.28211

Explaining CLIP Zero-shot Predictions Through Concepts

Onat Ozdemir, Anders Christensen, Stephan Alaniz, Zeynep Akata, Emre Akbas

AI Summary

This paper introduces EZPC, a method for explaining CLIP's zero-shot image recognition predictions by projecting CLIP's joint image-text embeddings into a human-understandable concept space learned from language descriptions. EZPC uses alignment and reconstruction objectives to ensure concept activations preserve CLIP's semantic structure while remaining interpretable, without requiring additional supervision. Experiments across five datasets demonstrate that EZPC maintains CLIP's accuracy while providing meaningful concept-level explanations.

Key Contribution

Unlock CLIP's black box: EZPC reveals the "why" behind zero-shot image recognition by grounding predictions in human-understandable concepts, without sacrificing accuracy.

Abstract

Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.

Computer Vision Interpretability & Mechanistic Interp Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Explaining CLIP Zero-shot Predictions Through Concepts

Related Papers