Mar 16, 2026arXiv:2603.15478

ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang

AI Summary

ViFeEdit is introduced as a video-free tuning framework for video diffusion transformers (DiTs) that circumvents the need for video training data. It achieves this by reparameterizing the DiT architecture to decouple spatial independence from 3D attention, enabling adaptation using only 2D images. The framework employs a dual-path pipeline with separate timestep embeddings, enhancing adaptability to various conditioning signals while maintaining temporal consistency.

Key Contribution

Achieve versatile and controllable video generation and editing using diffusion transformers, without any video training data, by cleverly reparameterizing the architecture and training only on 2D images.

Abstract

Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

Related Papers