Mar 3, 2026arXiv:2603.02557

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Maoyuan Shao, Yutong Gao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Guoshun Nan

AI Summary

The paper introduces Confusion-Aware Prompt Tuning (CAPT) to mitigate systematic misclassifications in vision-language models arising from confusion between visually and semantically similar categories. CAPT constructs a Confusion Bank to model stable confusion relationships and employs Semantic and Sample Confusion Miners to capture inter-class confusion and retrieve representative misclassified instances, respectively. A Multi-Granularity Difference Expert module then unifies confusion information across semantic and sample levels, leading to improved discriminability and generalization.

Key Contribution

Vision-language models can learn to correct their own systematic errors by explicitly modeling confusion patterns between similar categories, leading to a 50% reduction in misclassifications.

Abstract

Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Related Papers