Stanford HAIFeb 27, 2026arXiv:2602.24289

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Shengqu Cai, Weili Nie, Chao Liu, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat

AI Summary

The paper introduces a training paradigm called "Mode Seeking meets Mean Seeking" (MMM) to address the challenge of generating long, coherent videos given limited long-video data. MMM decouples local fidelity from long-term coherence using a Decoupled Diffusion Transformer with two heads: a global Flow Matching head for long-range structure and a local Distribution Matching head for realism via reverse-KL divergence to a frozen short-video teacher. The method achieves minute-scale video generation with improved local sharpness, motion, and long-range consistency by learning long-range coherence from limited long videos and inheriting local realism from short videos.

Key Contribution

Generate minute-long videos with compelling narrative structure and local realism, even with limited long-form training data, by cleverly combining supervised flow matching for global coherence with mode-seeking alignment to a short-video teacher for local fidelity.

Abstract

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Related Papers