Search papers, labs, and topics across Lattice.
The paper introduces ClipTBP, a clip-pair based framework for video moment retrieval that enhances visual-linguistic similarity learning by considering relationships between multiple relevant answer segments. It addresses the limitation of existing methods that struggle with visually similar but irrelevant segments by introducing a clip-level alignment loss. The framework also employs both main and auxiliary boundary losses for more accurate temporal boundary prediction, demonstrating consistent performance improvements across various existing models, especially in ambiguous query scenarios.
By explicitly modeling relationships between multiple relevant video segments, ClipTBP significantly improves video moment retrieval, especially when queries are ambiguous.
Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on boundary-aware learning. ClipTBP introduces a clip-level alignment loss for explicitly learning the semantic relationship between answer segments. ClipTBP also predicts accurate temporal boundaries by applying both main boundary loss and auxiliary boundary loss. ClipTBP consistently improves performance when applied to various existing models and demonstrates more robust boundary prediction performance even in ambiguous query scenarios.