HUSTUCLWHUMar 14, 2026arXiv:2605.27920

Rethinking Video-Language Model from the Language Input Perspective

Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu

AI Summary

This paper addresses the limitations of current Video-Language Models (VLMs) that rely on predefined text templates, which are impractical and restrictive. They introduce a plug-and-play framework that generates diverse positive and negative texts from original inputs to target specific text components. The framework then employs an attribute-based text reasoning strategy to extract fine-grained textual semantics and uses videos as guidance for cross-modal bridging via a self-weighted loss, demonstrably improving the performance of existing VLMs.

Key Contribution

VLMs can be significantly improved by reasoning over diverse, generated text inputs, rather than relying on restrictive, predefined templates.

Abstract

Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations10

Influential citations0

References66

Year2026

VenueProceedings of the AAAI Conference on Artificial Intelligence

Related Papers

Finding related papers...

Search

Rethinking Video-Language Model from the Language Input Perspective

Related Papers