Mar 3, 2026arXiv:2603.02959

Semi-Supervised Few-Shot Adaptation of Vision-Language Models

AI Summary

This paper addresses the challenge of class imbalance in few-shot adaptation of Vision-Language Models (VLMs) for medical image classification. They introduce a semi-supervised learning approach that leverages unlabeled data to generate text-informed pseudo-labels, which are then used to augment the limited labeled data during adaptation. The proposed method demonstrates a reduction in labeling effort by over 50% in low-shot regimes, improving performance in scenarios with significant class imbalances.

Key Contribution

Slap pseudo-labels on your unlabeled medical images and cut your annotation budget in half when adapting vision-language models.

Abstract

Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Semi-Supervised Few-Shot Adaptation of Vision-Language Models

Related Papers