Feb 23, 2026arXiv:2602.19756

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

AI Summary

This paper introduces a learning-free multimodal dataset distillation framework that leverages CLIP embeddings and an unCLIP decoder to synthesize image-text pairs from prototypes. By avoiding full-dataset training and joint optimization, the method achieves better cross-architecture generalization compared to existing optimization-based distillation techniques. Experiments demonstrate state-of-the-art performance in distilling multimodal datasets for efficient training of vision-language models.

Key Contribution

Ditch the costly training data: this new method distills multimodal datasets without any learning, using CLIP and unCLIP to synthesize data from prototypes.

Abstract

Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

Related Papers