Duc Thanh Nguyen

D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on Matrix Attention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention. 1 Introduction Table 1: Comparison of our proposed Global and Hybrid (Global–Local) factorized attention mechanisms (bold) with existing attention designs for DiT. The symbols ✓, ✗ indicate whether each method possesses a given property. Property

Papers on Lattice

Total citations

Topics

Research focus

Architecture Design (Transformers, SSMs, MoE) (1)Computer Vision (1)Training Efficiency & Optimization (1)

Frequent co-authors

Minh Khoa Le (1)Kien Do (1)Truyen Tran (1)

Papers (1)

Mar 10, 2026

Mar 10, 2026·also Cohere

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

FrameDiT achieves state-of-the-art video generation by ditching token-level attention for a novel matrix-based attention that operates directly on entire frames.

Minh Khoa Le, Kien Do, Duc Thanh Nguyen +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Search

Duc Thanh Nguyen

Research focus

Frequent co-authors

Papers (1)