Search papers, labs, and topics across Lattice.
This paper introduces EZPC, a method for explaining CLIP's zero-shot image recognition predictions by projecting CLIP's joint image-text embeddings into a human-understandable concept space learned from language descriptions. EZPC uses alignment and reconstruction objectives to ensure concept activations preserve CLIP's semantic structure while remaining interpretable, without requiring additional supervision. Experiments across five datasets demonstrate that EZPC maintains CLIP's accuracy while providing meaningful concept-level explanations.
Unlock CLIP's black box: EZPC reveals the "why" behind zero-shot image recognition by grounding predictions in human-understandable concepts, without sacrificing accuracy.
Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.