Search papers, labs, and topics across Lattice.
This paper introduces Spatial Causal Prediction (SCP), a novel task paradigm designed to evaluate a model's ability to infer unseen past or future spatial states in videos, going beyond visible spatio-temporal understanding. To facilitate this, the authors created SCP-Bench, a benchmark dataset with 2,500 QA pairs across 1,181 videos. Experiments on 23 state-of-the-art models using SCP-Bench revealed significant performance gaps compared to humans, highlighting limitations in temporal extrapolation and causal grounding.
Current video models struggle to infer unseen spatial states and causal relationships, falling far short of human-level spatial reasoning.
Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.