May 28, 2026arXiv:2605.30409

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Yuyang Zhao, Yicheng Pan, Qiyuan He, Tian Ye, Haozhe Liu, Enze Xie, Song Han

AI Summary

SANA-Streaming introduces a system-algorithm co-design for real-time, high-resolution streaming video editing using a hybrid diffusion transformer architecture. The architecture combines softmax attention with linear layers for efficient local modeling, while a cycle-reverse regularization training strategy improves temporal consistency. Optimized fused GDN kernels and mixed-precision quantization on an RTX 5090 GPU enable the system to achieve 24 FPS at 1280x704 resolution, outperforming existing state-of-the-art methods.

Key Contribution

Real-time, high-resolution video editing is now possible on a single consumer GPU, thanks to a novel hybrid diffusion transformer and system-level optimizations that achieve 24 FPS at 1280x704.

Abstract

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Related Papers