Guangzhou UniversityHUSTNJUNTUUCLWHUMar 14, 2026arXiv:2605.27894

Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, Wei Ji

AI Summary

This paper introduces a unified video-language model (VLM) designed to handle incomplete multi-modal inputs, addressing the inconsistency between training and testing data caused by missing modalities like video or language. The proposed model acts as a plug-and-play module that enhances existing VLMs' performance across various multi-modal tasks when faced with incomplete data. Experiments demonstrate the effectiveness of the approach in improving the robustness and generalization of VLMs in scenarios with missing modalities.

Key Contribution

VLMs can now handle real-world sensor failures and data privacy constraints without catastrophic performance drops, thanks to a new plug-and-play module for incomplete multi-modal inputs.

Abstract

Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.

Computer Vision Multimodal Models

Citation Metrics

Citations11

Influential citations0

References84

Year2026

VenueProceedings of the AAAI Conference on Artificial Intelligence

Related Papers

Finding related papers...

Search

Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

Related Papers