Mar 16, 2026arXiv:2603.15166

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

AI Summary

The paper introduces Distillation with Adaptive Intermediate Teacher transfer (DAIT), a novel knowledge distillation method for transferring knowledge from large Vision-Language Models (VLMs) to lightweight classifiers for fine-grained visual categorization (FGVC). DAIT uses a trainable intermediate teacher network, supervised by the target FGVC task, to adaptively transfer knowledge from the frozen VLM, mitigating architectural misalignment and task-irrelevant information. Experiments on FGVC benchmarks demonstrate significant performance gains (up to 12.63% on FGVC-Aircraft and 8.34% on CUB-200-2011) compared to direct distillation methods.

Key Contribution

Achieve up to 12.63% performance gains on fine-grained visual categorization by adaptively distilling knowledge from VLMs to lightweight classifiers using a task-aligned intermediate teacher.

Abstract

Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.

Computer Vision Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Related Papers