Mar 10, 2026arXiv:2603.09657

When to Lock Attention: Training-Free KV Control in Video Diffusion

Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian Wang

AI Summary

The paper introduces KV-Lock, a training-free framework for DiT-based video diffusion models that dynamically controls key-value (KV) caching and classifier-free guidance (CFG) scale based on a diffusion hallucination metric. This metric quantifies generation diversity, allowing KV-Lock to strengthen background KV locking and amplify conditional guidance when hallucination risk is detected. Experiments demonstrate that KV-Lock improves foreground quality while maintaining high background fidelity in video editing tasks compared to existing methods.

Key Contribution

Achieve better video editing without retraining by dynamically locking background features based on a "hallucination metric" that detects when the diffusion model is about to go astray.

Abstract

Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When to Lock Attention: Training-Free KV Control in Video Diffusion

Related Papers