Search papers, labs, and topics across Lattice.
This paper adapts VACE, a video diffusion model with unified control, for real-time autoregressive video generation by moving reference frames from the diffusion latent space to a parallel conditioning pathway. This modification enables fixed chunk sizes and KV caching necessary for autoregressive models, addressing VACE's original reliance on bidirectional attention over full sequences. The adapted VACE maintains structural control and inpainting capabilities with a 20-30% latency overhead and negligible VRAM cost, although reference-to-video fidelity is reduced due to causal attention.
Real-time autoregressive video diffusion with VACE is now possible, but at the cost of significantly reduced reference fidelity.
We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at https://github.com/daydreamlive/scope.