Search papers, labs, and topics across Lattice.
This paper introduces In-context Sparse Attention (ISA), a novel sparse attention mechanism designed to mitigate the quadratic computational costs associated with in-context learning for video editing. ISA leverages the observation that context tokens have lower saliency and query sharpness correlates with approximation error to prune redundant context and dynamically route queries to either full or sparse attention. The authors then build LIVEditor, a video editing model using ISA and a new 1.7M dataset, achieving a 60% reduction in attention latency while surpassing SOTA performance on multiple video editing benchmarks.
Achieve near-lossless 60% attention latency reduction in video editing by exploiting query sharpness to dynamically route attention.
Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \textbf{\texttt{LIVEditor}} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a $\sim$60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.