Search papers, labs, and topics across Lattice.
This paper introduces a novel dynamic thresholding scheme for sparse attention in Diffusion Transformers (DiTs), aimed at optimizing both sparsity and accuracy during video generation. By framing the token filtering process as a dual-goal optimization problem, the authors demonstrate that maintaining a fixed recall rate can effectively balance these competing objectives, outperforming existing methods like Top-p and Top-k. Experimental results show that their approach increases sparsity from 61.42% to 82% while achieving less than a 5% drop in accuracy, leading to a 15% reduction in attention computation and a 1.61x boost in computational efficiency compared to the BLASST algorithm.
Achieving 82% sparsity with minimal accuracy loss, this method redefines efficiency in video generation for Diffusion Transformers.
Sparse attention accelerates Diffusion Transformers (DiTs) for video generation by computing only the important tokens while skipping the rest. The token selection strategy is key to balancing sparsity and accuracy. We formulate the token filtering process as a dual-goal optimization problem: maximizing sparsity and minimizing accuracy degradation. Existing algorithms cannot fulfill both objectives simultaneously. For example, Top-p only considers the accuracy constraint, while Top-k maintains a fixed computational budget but loosens the accuracy constraint. This paper demonstrates that maintaining a fixed recall rate is sufficient for ensuring accuracy, whereas a fixed threshold is suboptimal for reducing computational cost. Therefore, we propose a dynamic thresholding scheme to improve sparsity while maintaining the same level of accuracy. Furthermore, our algorithm is deeply integrated with Flash Attention (FA), eliminating the need for any additional masking computation overhead. Experimental results on Wan 2.2 validate that, compared to the BLASST algorithm which is also integrated with FA, our dynamic thresholding strategy enhances sparsity from 61.42\% to 82\% with a VBench metric drop of less than 5\%. This results in an approximate 15\% in attention computation and a $1.61\times$ increase in computational efficiency, which is 1.18x higher than that of BLASST.