Mar 9, 2026arXiv:2603.08224

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

Ruixiang Zhao, Zhihao Xu, Bangxiang Lan, Zijie Xin, Jingyu Liu, Xirong Li

AI Summary

The paper introduces SAVE, a novel approach to video-text retrieval that explicitly incorporates speech information to address the limitations of vision-only CLIP-based methods. SAVE uses a dedicated speech branch for improved speech embedding and employs soft-ALBEF for early vision-audio alignment, enhancing the fusion of these modalities. Experimental results across five benchmarks demonstrate that SAVE achieves state-of-the-art performance, significantly outperforming AVIGATE in SumR metric.

Key Contribution

By explicitly modeling speech, SAVE leapfrogs existing audio-visual methods for video-text retrieval, achieving substantial gains over the state-of-the-art.

Abstract

For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-9k, +1.9% on MSRVTT-7k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.

Multimodal Models Recommendation & Information Retrieval Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

Related Papers