IASMichigan StateUESTCUniversity of Electronic Science and TechnologyApr 22, 2026arXiv:2604.20470

DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

Yongji Long, Shijun Liang, Jintao Li, Yun Li

AI Summary

The paper introduces DynamicRad, a content-adaptive sparse attention mechanism for video diffusion models that leverages a radial locality prior and a dual-mode strategy (static-ratio and dynamic-threshold) for efficient long-range information retention. To avoid online search overhead, they use offline Bayesian Optimization (BO) coupled with a semantic motion router to map prompt embeddings to optimal sparsity regimes. Experiments on HunyuanVideo and Wan2.1-14B show DynamicRad achieves 1.7x-2.5x inference speedups with over 80% effective sparsity, even matching or exceeding dense attention in long sequences.

Key Contribution

Achieve near-lossless 2x speedups in long video diffusion by learning to attend only to the most relevant parts of the video, guided by motion and semantic content.

Abstract

Leveraging the natural spatiotemporal energy decay in video diffusion offers a path to efficiency, yet relying solely on rigid static masks risks losing critical long-range information in complex dynamics. To address this issue, we propose \textbf{DynamicRad}, a unified sparse-attention paradigm that grounds adaptive selection within a radial locality prior. DynamicRad introduces a \textbf{dual-mode} strategy: \textit{static-ratio} for speed-optimized execution and \textit{dynamic-threshold} for quality-first filtering. To ensure robustness without online search overhead, we integrate an offline Bayesian Optimization (BO) pipeline coupled with a \textbf{semantic motion router}. This lightweight projection module maps prompt embeddings to optimal sparsity regimes with \textbf{minimal runtime overhead}. Unlike online profiling methods, our offline BO optimizes attention reconstruction error (MSE) on a physics-based proxy task, ensuring rapid convergence. Experiments on HunyuanVideo and Wan2.1-14B demonstrate that DynamicRad pushes the efficiency--quality Pareto frontier, achieving \textbf{1.7$\times$--2.5$\times$ inference speedups} with \textbf{over 80\% effective sparsity}. In some long-sequence settings, the dynamic mode even matches or exceeds the dense baseline, while mask-aware LoRA further improves long-horizon coherence. Code is available at https://github.com/Adamlong3/DynamicRad.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

Related Papers