Search papers, labs, and topics across Lattice.
The paper introduces CRONOS, a benchmark for evaluating counterfactual physical consistency in video prediction models by assessing their response to controlled visual input changes. CRONOS uses a photorealistic Unreal Engine environment to generate videos with interventions on viewpoint, scene, object category, and object appearance, while keeping the underlying physical event constant. Experiments on recent video generators reveal significant failures in maintaining prediction quality across these interventions, highlighting a lack of true causal understanding.
Video models stumble when the camera angle changes, revealing they're often just memorizing visuals, not grasping physics.
Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.