XiaohongshuFeb 12, 2026arXiv:2602.11562

LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling

Tianhe Lin, Baoyuan Ou, Yingjie Qin, Lai Xu, Yao Hu, Zhiyong Wang, Yubin Xu

AI Summary

The paper introduces LASER, a full-stack optimization framework for efficient long sequence modeling in recommendation systems, addressing I/O and computational bottlenecks. LASER incorporates SeqVault, a hybrid DRAM-SSD indexing strategy, to reduce retrieval latency, and Segmented Target Attention (STA), a novel attention mechanism with a sigmoid-based gating strategy and Global Stacked Target Attention (GSTA), to reduce computational complexity. Online A/B testing showed LASER achieved significant improvements in ADVV and revenue, demonstrating its practical impact.

Key Contribution

Xiaohongshu's LASER framework slashes latency and boosts revenue by 2% in real-world recommendation systems via a novel segmented attention mechanism and a hybrid DRAM-SSD indexing strategy.

Abstract

Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict"Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling

Related Papers