May 6, 2026arXiv:2605.04569

Lightning Unified Video Editing via In-Context Sparse Attention

Shitong Shao, Haopeng Li, Yingwei Song, Wenliang Zhong, Lichen Bai, Zeke Xie

AI Summary

This paper introduces In-context Sparse Attention (ISA), a novel sparse attention mechanism designed to mitigate the quadratic computational costs associated with in-context learning for video editing. ISA leverages the observation that context tokens have lower saliency and query sharpness correlates with approximation error to prune redundant context and dynamically route queries to either full or sparse attention. The authors then build LIVEditor, a video editing model using ISA and a new 1.7M dataset, achieving a 60% reduction in attention latency while surpassing SOTA performance on multiple video editing benchmarks.

Key Contribution

Achieve near-lossless 60% attention latency reduction in video editing by exploiting query sharpness to dynamically route attention.

Abstract

Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \textbf{\texttt{LIVEditor}} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a $\sim$60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Lightning Unified Video Editing via In-Context Sparse Attention

Related Papers