HUSTLa Trobe UniversityApr 22, 2026arXiv:2604.20311

Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

Dali Wang, Yunyao Zhang, Junqing Yu, Yi-Ping Phoebe Chen, Chenzhong Xu, Zikai Song

AI Summary

This paper introduces a unified framework for micro-video popularity prediction (MVPP) that addresses limitations in existing approaches by jointly enlarging the temporal and spatial dimensions of video analysis. The Temporal Enlargement module uses a frame scoring mechanism to extract highlight cues from both sparse and dense sampling, enabling robust understanding of long video sequences. The Spatial Enlargement module constructs a Topology-Aware Memory Bank that hierarchically clusters relevant historical content, allowing for unbounded historical association without unbounded storage growth. The proposed method outperforms 11 baselines on three MVPP benchmarks, demonstrating improvements in prediction accuracy and ranking consistency.

Key Contribution

Achieve unbounded historical video association for popularity prediction without unbounded storage growth by clustering videos in a topology-aware memory bank and updating cluster features instead of storing individual videos.

Abstract

Micro-video popularity prediction (MVPP) aims to forecast the future popularity of videos on online media, which is essential for applications such as content recommendation and traffic allocation. In real-world scenarios, it is critical for MVPP approaches to understand both the temporal dynamics of a given video (temporal) and its historical relevance to other videos (spatial). However, existing approaches sufer from limitations in both dimensions: temporally, they rely on sparse short-range sampling that restricts content perception; spatially, they depend on flat retrieval memory with limited capacity and low efficiency, hindering scalable knowledge utilization. To overcome these limitations, we propose a unified framework that achieves joint spatio-temporal enlargement, enabling precise perception of extremely long video sequences while supporting a scalable memory bank that can infinitely expand to incorporate all relevant historical videos. Technically, we employ a Temporal Enlargement driven by a frame scoring module that extracts highlight cues from video frames through two complementary pathways: sparse sampling and dense perception. Their outputs are adaptively fused to enable robust long-sequence content understanding. For Spatial Enlargement, we construct a Topology-Aware Memory Bank that hierarchically clusters historically relevant content based on topological relationships. Instead of directly expanding memory capacity, we update the encoder features of the corresponding clusters when incorporating new videos, enabling unbounded historical association without unbounded storage growth. Extensive experiments on three widely used MVPP benchmarks demonstrate that our method consistently outperforms 11 strong baselines across mainstream metrics, achieving robust improvements in both prediction accuracy and ranking consistency.

Computer Vision Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

Related Papers