Mar 10, 2026arXiv:2603.09721

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran

AI Summary

The paper introduces Matrix Attention, a frame-level temporal attention mechanism for video diffusion models that processes entire frames as matrices to capture global spatio-temporal structure. FrameDiT-G and FrameDiT-H, DiT architectures incorporating Matrix Attention, are proposed to improve temporal coherence and video quality. Experiments demonstrate that FrameDiT-H achieves state-of-the-art results on video generation benchmarks while maintaining efficiency comparable to Local Factorized Attention.

Key Contribution

FrameDiT achieves state-of-the-art video generation by ditching token-level attention for a novel matrix-based attention that operates directly on entire frames.

Abstract

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

Related Papers