Search papers, labs, and topics across Lattice.
This paper introduces UniTemp, a novel framework that enables autoregressive video generation in any temporal order by employing bidirectional distillation. The authors tackle the limitations of existing video diffusion models, which are confined to forward generation and struggle with inter-block discontinuities when generating backward due to their causal structure. By incorporating blockwise anchor latents, UniTemp allows for flexible conditioning on both past and future frames, demonstrating competitive performance across various video generation tasks while enhancing controllability and workflow diversity.
Unlocking video generation in any temporal order, UniTemp enables innovative workflows like bidirectional extensions and visual story generation without sacrificing performance.
Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/