HanyangMar 5, 2026arXiv:2603.05437

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, MinJu Jeon, Hyun-Gi Kim, Hyungee Kim, Dong-Jin Kim

AI Summary

The paper introduces SAIL, a novel approach to Weakly-Supervised Dense Video Captioning (WSDVC) that improves event localization and description by constructing semantically-aware masks using cross-modal alignment between video regions and event captions. SAIL employs a similarity-aware training objective to guide masks towards video regions highly similar to their captions, addressing the limitations of uniformly distributed masks in prior work. To combat data sparsity, SAIL leverages an LLM-based augmentation strategy to generate synthetic captions and incorporates them through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization.

Key Contribution

LLM-augmented training with similarity-aware masking lets weakly-supervised video captioning models generate more accurate event descriptions and temporal boundaries, even with sparse training data.

Abstract

Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Related Papers