Virginia TechMay 28, 2026arXiv:2605.30351

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Hidir Yesiltepe, Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Tuna Han Salih Meral, Adil Kaan Akan, Adil Kaan Akan, Kaan Oktay, Kaan Oktay, Hoda Eldardiry, Hoda Eldardiry, Pinar Yanardag, Pinar Yanardag

AI Summary

This paper introduces VideoMLA, a novel approach to causal video diffusion that employs a low-rank latent key-value (KV) cache, significantly reducing per-token memory usage by 92.7% while maintaining quality. The authors challenge the prevailing assumption that pretrained video attention is low-rank, demonstrating that the effective rank is determined by the MLA bottleneck rather than the pretrained spectrum. Experimental results on VBench show that VideoMLA not only matches baseline performance for short-horizon tasks but also outperforms other methods at long horizons, achieving a 1.23x improvement in throughput on a single B200 GPU.

Key Contribution

Reducing video diffusion memory usage by 92.7% without sacrificing quality could revolutionize streaming video applications.

Abstract

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Related Papers