Stanford HAICUHKNTUShanghai InnovationSJTUJun 15, 2026arXiv:2606.16449

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Shuai Yang, Bingjie Gao, Ziwei Liu, Jiaqi Wang, Dahua Lin, Tong Wu

AI Summary

This paper introduces PermaVid, a novel framework for consistent video generation that addresses the challenges of maintaining coherence across edits by employing a multi-modal context memory. By disentangling spatial context into semantic appearance and geometric structure, PermaVid utilizes an edit-aware memory update and retrieval strategy to ensure memory evolution aligns with subsequent observations. Experimental results show that PermaVid significantly outperforms existing methods in maintaining long-term semantic and structural consistency after edits, marking a substantial advancement in video generation technology.

Key Contribution

PermaVid achieves unprecedented long-term consistency in video generation, even after significant edits, by disentangling appearance and geometry in its memory architecture.

Abstract

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Related Papers