DUTHUSTPKUSCUShenzhen UniveristyMar 24, 2024arXiv:2605.29793

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, Renfu Li

AI Summary

The paper introduces SpotVMR, a method for efficient video moment retrieval that addresses the limitations of fixed-length clip sampling in existing approaches. SpotVMR learns to identify promising video regions conditioned on the language query using a novel clip search model and low-cost semantic indexing features. By trimming the video into query-relevant clips, SpotVMR reduces boundary and reasoning biases, leading to improved retrieval performance and efficiency, as demonstrated on three challenging datasets.

Key Contribution

Stop wasting compute on irrelevant video clips: SpotVMR trims videos to only the query-relevant moments, boosting retrieval performance while slashing computational cost.

Abstract

Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations44

Influential citations0

References119

Year2024

VenueAAAI Conference on Artificial Intelligence

Related Papers

Finding related papers...

Search

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

Related Papers