Search papers, labs, and topics across Lattice.
This paper reviews video generation models through the lens of efficiency, arguing that efficiency is crucial for these models to serve as practical world simulators. It introduces a taxonomy of efficient video generation techniques, categorizing them by modeling paradigms, network architectures, and inference algorithms. The review highlights the importance of efficiency for enabling interactive applications like autonomous driving and embodied AI.
Efficiency is the key bottleneck preventing video generation models from becoming general-purpose world simulators, and this paper provides a taxonomy of techniques to overcome it.
The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.