KITFeb 18, 2026arXiv:2602.16590

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Qi You, Qi You, Yitai Cheng, James Haworth, Zichao Zeng, James Haworth

AI Summary

The paper introduces CLIP-MHAdapter, a computationally efficient adaptation method for CLIP that enhances street-view image attribute classification by incorporating multi-head self-attention on patch tokens to model inter-patch dependencies. This approach addresses the limitation of existing methods that rely on global image embeddings, which are insufficient for capturing fine-grained attributes in complex street scenes. Results on the Global StreetScapes dataset demonstrate that CLIP-MHAdapter achieves state-of-the-art or competitive accuracy across eight attribute classification tasks with only 1.4 million trainable parameters.

Key Contribution

You can now get SOTA street-view image classification from CLIP with a tiny 1.4M parameter adapter that focuses on local image patches.

Abstract

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References31

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Related Papers