Search papers, labs, and topics across Lattice.
The paper introduces HOCA-Bench, a new benchmark designed to evaluate Video-LLMs' ability to perform predictive world modeling by identifying ontological and causal anomalies in videos. The benchmark comprises 1,439 videos with 3,470 QA pairs generated using generative video models as adversarial simulators. Evaluation of 17 Video-LLMs reveals a significant performance drop (over 20%) on causal anomaly detection compared to ontological anomalies, highlighting a deficiency in applying physical laws for reasoning.
Video-LLMs can spot a shape-shifting object, but still fail at basic physics, revealing a critical gap in predictive world modeling that HOCA-Bench exposes.
Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 "Thinking" modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.