Search papers, labs, and topics across Lattice.
This paper introduces a unified video-language model (VLM) designed to handle incomplete multi-modal inputs, addressing the inconsistency between training and testing data caused by missing modalities like video or language. The proposed model acts as a plug-and-play module that enhances existing VLMs' performance across various multi-modal tasks when faced with incomplete data. Experiments demonstrate the effectiveness of the approach in improving the robustness and generalization of VLMs in scenarios with missing modalities.
VLMs can now handle real-world sensor failures and data privacy constraints without catastrophic performance drops, thanks to a new plug-and-play module for incomplete multi-modal inputs.
Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.