Mar 5, 2026arXiv:2603.05235

Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning

Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Yuhua Li, Ruixuan Li

AI Summary

The paper investigates why removing certain middle layers of CLIP's text encoder improves performance in Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL). They find that these "lost layers" contain beneficial information that is underutilized due to visual gaps between the text and image embeddings. To address this, they propose a method to re-utilize the information in these layers by guiding the re-learning of the visual branch under domain shifts, achieving improved performance across various datasets and backbones.

Key Contribution

CLIP's "lost" text encoder layers actually contain valuable information for cross-domain few-shot learning, and a method to re-utilize them significantly boosts performance.

Abstract

Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross-domain tasks, however, we find that \textbf{removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL}, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant. Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to \textbf{re-utilize} information in these lost layers at both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-VtT.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References78

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning

Related Papers