Tsinghua AIFeb 13, 2026arXiv:2602.12675

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang, Haoxu Wang, Kai Jiang, Youhe Jiang, Youhe Jiang, Ion Stoica, Jianfei Chen, Joseph E. Gonzalez

AI Summary

The paper introduces SLA2, a novel sparse-linear attention mechanism designed to improve the efficiency and accuracy of attention computations in video diffusion models. SLA2 addresses limitations in the original SLA by incorporating a learnable router to dynamically allocate computations between sparse and linear attention branches, and by introducing a learnable ratio to combine these branches more effectively. Furthermore, SLA2 employs quantization-aware fine-tuning to enable low-bit attention, further reducing computational costs while maintaining generation quality.

Key Contribution

Achieve an 18.6x speedup in video diffusion models with 97% attention sparsity by learning how to route and combine sparse and linear attention, outperforming heuristic approaches.

Abstract

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations1

Influential citations0

References65

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Related Papers