Open-Sora Plan TeamPKUMar 31, 2026arXiv:2603.29494

VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

Anmin Liu, Ruixuan Yang, Huiqiang Jiang, Bin Lin, Minmin Sun, Chen Zhang, Tao Xie

AI Summary

This paper introduces VecAttention, a novel vector-wise sparse attention mechanism tailored for efficient long-context video processing in Transformers. By exploiting the observed vertical-vector sparsity pattern in video attention maps, VecAttention dynamically selects and processes only the most informative vectors. Experiments on video understanding and generation tasks demonstrate that VecAttention achieves significant speedups (up to 2.65x over full attention and 1.83x over SOTA sparse attention) while maintaining comparable accuracy.

Key Contribution

Video Transformers can achieve near-full attention accuracy with significantly less compute by focusing only on informative vertical vectors.

Abstract

Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65$\times$ speedup over full attention and a 1.83$\times$ speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at https://github.com/anminliu/VecAttention.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

Related Papers