NVIDIATel-Aviv UnivercityMar 5, 2026arXiv:2603.05503

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Shai Yehezkel, Shahar Yadin, S. Yadin, Noam Elata, Y. Ostrovsky-Berman, Yaron Ostrovsky-Berman, Bahjat Kawar

AI Summary

The paper introduces CalibAtt, a training-free method to accelerate text-to-video generation by exploiting sparsity and repetition patterns in spatiotemporal attention. CalibAtt performs an offline calibration pass to identify stable block-level sparsity patterns across inputs and diffusion timesteps, compiling these into optimized attention operations. Experiments on Wan 2.1 14B, Mochi 1, and distilled models demonstrate up to 1.58x end-to-end speedup compared to existing training-free methods, without sacrificing video quality or text alignment.

Key Contribution

Text-to-video generation gets a 1.58x speed boost with CalibAtt, a training-free method that exploits consistent sparsity patterns in attention layers.

Abstract

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Related Papers