May 27, 2026arXiv:2605.28604

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Xiao Wang, Minglei Yang, Wenke Huang, Zheng Wang, Mang Ye

AI Summary

This paper introduces the Video Important Person (VIP) identification task, addressing the problem of Temporal Importance Shift (TIS) where an individual's importance changes over time within a video. To tackle this, they created Temporal-VIP, a large-scale, rationale-annotated dataset, and developed VIP-Net, a framework that uses a Social Cue Encoder (SCE) for multi-modal spatio-temporal cue extraction and a Temporal Importance Rectifier (TIR) for hierarchical cue fusion. VIP-Net significantly outperforms existing methods, achieving 67.3% accuracy in identifying important individuals and generating rationales that align well with ground truth.

Key Contribution

Current VIP identification methods miss the forest for the trees, leading to "Temporal Importance Shift"—but a new model leveraging spatio-temporal cues and rationale generation closes the gap.

Abstract

Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Current methods primarily focus on static images and immediate visual cues, overlooking the rich spatio-temporal information in videos. This leads to the phenomenon of Temporal Importance Shift (TIS), wherein individuals deemed significant in early frames may be demoted as the entire temporal context is considered. To address this, we introduce the Video Important Person (VIP) identification task, aimed at automatically identifying the most influential individuals in videos while providing textual rationales. We present Temporal-VIP, a large-scale rationale-annotated dataset consisting of 9,249 video segments across 11 categories with aligned importance rationales. To mitigate TIS, we develop the VIP-Net framework, which includes a Social Cue Encoder (SCE) for extracting multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical cue fusion and cross-modal alignment, and VIP Inference for ranking individuals. Experimental results show that VIP-Net achieves 67.3% accuracy, significantly outperforming state-of-the-art models (37.5%-53.9%) and yielding a mean rationale similarity of 0.63 to ground truth through feature-guided LLM refinement. The dataset and code are available at https://huggingface.co/datasets/yml2002/Temporal-VIP.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Related Papers