Search papers, labs, and topics across Lattice.
The paper introduces a training paradigm called "Mode Seeking meets Mean Seeking" (MMM) to address the challenge of generating long, coherent videos given limited long-video data. MMM decouples local fidelity from long-term coherence using a Decoupled Diffusion Transformer with two heads: a global Flow Matching head for long-range structure and a local Distribution Matching head for realism via reverse-KL divergence to a frozen short-video teacher. The method achieves minute-scale video generation with improved local sharpness, motion, and long-range consistency by learning long-range coherence from limited long videos and inheriting local realism from short videos.
Generate minute-long videos with compelling narrative structure and local realism, even with limited long-form training data, by cleverly combining supervised flow matching for global coherence with mode-seeking alignment to a short-video teacher for local fidelity.
Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.