Apr 8, 2026arXiv:2604.06939

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

Chengyu Bai, Junjun hu, Xinda Xue, Mu Xu

AI Summary

The paper introduces Grounded Forcing, a framework for autoregressive video synthesis that tackles semantic forgetting, visual drift, and controllability loss by bridging time-independent semantics and proximal dynamics. It employs a Dual Memory KV Cache to decouple local temporal dynamics from global semantic anchors, Dual-Reference RoPE Injection to confine positional embeddings, and Asymmetric Proximity Recache for smooth semantic inheritance during prompt transitions. Experiments show Grounded Forcing significantly improves long-range consistency and visual stability in long-form video generation.

Key Contribution

Achieve stable, controllable, and semantically consistent long-form video generation by decoupling local dynamics from global semantic anchors.

Abstract

Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.

Architecture Design (Transformers, SSMs, MoE)Computer Vision World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

Related Papers