Jun 8, 2026arXiv:2606.09250

LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong, Jifei Song

AI Summary

This paper introduces LiteVSR, a lightweight framework that adapts frozen Diffusion Transformers for Video Super-Resolution (VSR) by utilizing flow matching to predict a constant velocity field across timesteps. This approach circumvents the need for extensive fine-tuning and avoids the inefficiencies associated with traditional ControlNet-style adapters, allowing for a significant reduction in trainable parameters and training time. LiteVSR achieves competitive restoration quality with only 11.25% of the parameters typically required, demonstrating the potential for efficient VSR in novel domains.

Key Contribution

Achieving competitive video super-resolution quality with just 11.25% of the usual trainable parameters, LiteVSR redefines efficiency in adapting frozen diffusion models.

Abstract

Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.

Computer Vision

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

Related Papers