KuaishouApr 27, 2026arXiv:2604.24432

Kwai Summary Attention Technical Report

Chenglong Chu, Guorui Zhou, Guowang Zhang, Hao Peng, Hong Cheng, Jian Liang, Kun Gai, Ling Zhou, Lu Ren, Qi Zhang, Ruiming Tang, Ruitao Wang, Xinchen Luo, Yinwu Su, Zhiyuan Liang, Ziqi Wang, Boyang Ding, Chengru Song, Dunju Zang, Hui Wang, Jiao Ou, Jiaxin Deng, Ji-jin Shi, Jinghao Zhang, Junmin Chen, Lejian Ren, Minxuan Lv, Qianqian Wang, Qigen Hu, Shiyao Wang, Si-Zhu Mao, Tao Wang, Xingmei Wang, Zhixin Ling, Ziming Li, Zixing Zhang

AI Summary

Kwai Summary Attention (KSA) is introduced as a novel attention mechanism that compresses long-context information into a smaller set of learnable summary tokens, achieving sub-linear KV cache scaling. Unlike methods that minimize KV cache size or interleave with other architectures, KSA maintains a linear relationship between KV cache and sequence length but with a compression ratio *k*, trading memory for complete long-range dependency retention. Experiments demonstrate the effectiveness of KSA in capturing long-range dependencies while reducing computational costs.

Key Contribution

Sub-linear attention is now possible without sacrificing complete long-range dependency retention, thanks to learnable summary tokens that compress context.

Abstract

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache'', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Kwai Summary Attention Technical Report

Related Papers