Search papers, labs, and topics across Lattice.
The paper introduces Attention Sink Anchored Pruning (ASAP), a training-free token reduction method for Vision Transformers that addresses the attention sink problem. ASAP models ViT information flow as a Lazy Random Walk to identify and leverage the attention sink, using Radial Diffusion Clustering and Transition Weight Pooling to compress background redundancy. Experiments show ASAP achieves up to 48% throughput acceleration while maintaining or improving accuracy across image, video, and vision-language tasks, outperforming existing token reduction methods.
Attention sinks, typically a problem for ViTs, can actually be leveraged for efficient token pruning, leading to faster inference without sacrificing accuracy.
Vision Transformers (ViTs) face severe computational bottlenecks due to the quadratic complexity of self-attention at high resolutions. Existing token reduction methods rely on local metrics - such as single-layer attention scores - that are inherently vulnerable to the attention sink phenomenon, where uninformative tokens are paradoxically preserved over salient foreground objects. We propose ASAP (Attention Sink Anchored Pruning), a training-free framework that recasts this sink as a feature. Modeling ViT information flow as a Lazy Random Walk, ASAP identifies the sink as a dominant accumulator of probability mass. By computing the diffusion distance to the sink within the cumulative transition matrix, ASAP partitions tokens via Radial Diffusion Clustering and compresses background redundancy through Transition Weight Pooling in a single shot. Extensive experiments across image, video, and vision-language tasks demonstrate ASAP outperforms state-of-the-art methods, accelerating throughput by up to 48% while maintaining - or even exceeding - baseline accuracy.