ETHBUPTCASPolyUFeb 12, 2026arXiv:2602.11636

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

Jiahuai Mao, Yuzhuo Miao, Shijie Lian, Xiaopeng Lin, Cong Huang, Lei Zhang, Kai Chen

AI Summary

The paper introduces ScalSelect, a training-free multimodal data selection method for Visual Instruction Tuning (VIT) that addresses the computational expense and inefficiency of training VLMs on large, redundant datasets. ScalSelect constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM and then identifies samples whose representations best approximate the dominant subspace of the full dataset. Experiments show ScalSelect achieves comparable or superior performance to full-data training using only a fraction of the data across multiple VLMs and datasets.

Key Contribution

Achieve >97.5% of full-data VIT performance with only 16% of the data using ScalSelect, a surprisingly effective and scalable training-free data selection method.

Abstract

Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

Related Papers