Search papers, labs, and topics across Lattice.
This paper introduces MBench, a novel benchmark designed to assess the memory capabilities of video world models, addressing a significant gap in existing evaluations that prioritize visual quality over long-term consistency. By decomposing memory into three core dimensions鈥攅ntity consistency, environment consistency, and causal consistency鈥攁long with 12 sub-dimensions, the authors provide a comprehensive framework for evaluating long-term memory retention in video generation. Evaluations of state-of-the-art models reveal substantial limitations in their ability to maintain stable internal states, highlighting the need for improved methodologies in the field.
Existing video world models struggle with long-term memory retention, and MBench exposes their critical limitations while providing a structured path for future improvements.
Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbf{MBench}, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.