Search papers, labs, and topics across Lattice.
The paper introduces HyTRec, a hybrid attention architecture for long sequence recommendation that combines linear and softmax attention mechanisms to balance efficiency and retrieval precision. HyTRec assigns long-term historical sequences to a linear attention branch and recent interactions to a softmax attention branch, mitigating the limitations of each approach when used alone. The authors further propose a Temporal-Aware Delta Network (TADN) to dynamically adjust the weights of historical behaviors, emphasizing recent signals and suppressing noise, which leads to significant improvements in Hit Rate, especially for users with ultra-long sequences.
By cleverly combining linear and softmax attention, HyTRec achieves state-of-the-art recommendation accuracy on long sequences while maintaining linear inference speed, resolving a key tradeoff in the field.
Modeling long sequences of user behaviors has emerged as a critical frontier in generative recommendation. However, existing solutions face a dilemma: linear attention mechanisms achieve efficiency at the cost of retrieval precision due to limited state capacity, while softmax attention suffers from prohibitive computational overhead. To address this challenge, we propose HyTRec, a model featuring a Hybrid Attention architecture that explicitly decouples long-term stable preferences from short-term intent spikes. By assigning massive historical sequences to a linear attention branch and reserving a specialized softmax attention branch for recent interactions, our approach restores precise retrieval capabilities within industrial-scale contexts involving ten thousand interactions. To mitigate the lag in capturing rapid interest drifts within the linear layers, we furthermore design Temporal-Aware Delta Network (TADN) to dynamically upweight fresh behavioral signals while effectively suppressing historical noise. Empirical results on industrial-scale datasets confirm the superiority that our model maintains linear inference speed and outperforms strong baselines, notably delivering over 8% improvement in Hit Rate for users with ultra-long sequences with great efficiency.