Search papers, labs, and topics across Lattice.
This paper introduces FedDTL, a federated learning framework for Vision-Language Models (VLMs) that decouples image and text encoders between clients and the server to address optimization inconsistencies. FedDTL employs server-client modality alignment for global semantic updates and uses a two-stage local fine-tuning process involving supervised learning for warm-start and reinforcement learning to enhance generalization. Experiments across various benchmarks demonstrate that FedDTL effectively balances global task adaptation and generalization under diverse federated learning data distributions.
Decoupling image and text encoders in federated learning achieves a better balance between global task adaptation and generalization for vision-language models.
Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation.To further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.