KuaishouApr 28, 2026arXiv:2604.25834

Action-Aware Generative Sequence Modeling for Short Video Recommendation

Wenhao Li, Zihan Lin, Zhengxiao Guo, Jie Zhou, Shukai Liu, Yongqi Liu, Chuan Luo, Chao Ma, Chaoyi Ma, Ruiming Tang, Han Li

AI Summary

This paper introduces the Action-Aware Generative Sequence Network (A2Gen), which enhances short video recommendations by modeling user actions as temporal sequences rather than treating videos as single entities. By employing a Context-aware Attention Module (CAM) and a Hierarchical Sequence Encoder (HSE), the model captures nuanced user preferences and action patterns, leading to improved prediction accuracy. Extensive offline and online experiments demonstrate that A2Gen significantly boosts user engagement metrics, including a 0.34% increase in watch time and an 8.1% rise in interaction rates on a platform serving over 400 million users daily.

Key Contribution

A2Gen transforms short video recommendations by treating user actions as dynamic sequences, resulting in substantial improvements in user engagement metrics.

Abstract

With the rapid development of the Internet, users have increasingly higher expectations for the recommendation accuracy of online content consumption platforms. However, short videos often contain diverse segments, and users may not hold the same attitude toward all of them. Traditional binary-classification recommendation models, which treat a video as a single holistic entity, face limitations in accurately capturing such nuanced preferences. Considering that user consumption is a temporal process, this paper demonstrates that the timing of user actions can represent diverse intentions through statistical analysis and examination of action patterns. Based on this insight, we propose a novel modeling paradigm: Action-Aware Generative Sequence Network (A2Gen), which refines user actions along the temporal dimension and chains them into sequences for unified processing and prediction. First, we introduce the Context-aware Attention Module (CAM) to model action sequences enriched with item-specific contextual features. Building upon this, we develop the Hierarchical Sequence Encoder (HSE) to learn temporal action patterns from users'historical actions. Finally, through leveraging CAM, we design a module for action sequence generation: the Action-seq Autoregressive Generator (AAG). Extensive offline experiments on the Kuaishou's dataset and the Tmall public dataset demonstrate the superiority of our proposed model. Furthermore, through large-scale online A/B testing deployed on Kuaishou's platform, our model achieves significant improvements over baseline methods in multi-task prediction by leveraging sequential information. Specifically, it yields increases of 0.34% in user watch time, 8.1% in interaction rate, and 0.162% in overall user retention (LifeTime-7), leading to successful deployment across all traffic, serving over 400 million users every day.

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Action-Aware Generative Sequence Modeling for Short Video Recommendation

Related Papers