USTCMar 9, 2026arXiv:2603.08536

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Chao Wang, Zijin Yang, Yaofei Wang, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen

AI Summary

The paper introduces SWIFT, a novel approach for attributing generated videos back to their source model without requiring training or modification of the video. SWIFT leverages the temporal structure of videos, specifically the mapping from multiple pixel frames to a single latent frame within video chunks, to perform normal and corrupted reconstructions using a sliding window. By comparing the reconstruction losses, SWIFT generates an attribution signal, achieving high accuracy with few samples and even enabling zero-shot attribution for some models.

Key Contribution

Achieve over 90% accuracy in attributing generated videos to their source model with as few as 20 samples, all without training or modifying the videos themselves.

Abstract

Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Related Papers