Tsinghua AIB) (Base: Wan2.2-IJun 11, 2026arXiv:2606.13035

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

Yu Meng, Xiangyang Luo, Letian Li, Wen Jiang, Wenyuan Jiang, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

AI Summary

This paper introduces TetherCache, a novel cache management strategy designed to enhance the stability of autoregressive long-form video generation by addressing the challenges of context retention and quality degradation. TetherCache employs two mechanisms: GRAB, which selects diverse long-range memory frames to maintain informative context, and TAME, which edits recalled memory tokens to align with a trusted context distribution, effectively mitigating drift. The approach significantly improves video generation quality across various lengths, notably reducing quality drift from 7.84 to 1.33 in 240-second videos, showcasing its effectiveness in long-horizon video synthesis.

Key Contribution

TetherCache slashes quality drift in long-form video generation from 7.84 to 1.33, ensuring stability and coherence over extended sequences.

Abstract

Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

Related Papers