QMULApr 23, 2026arXiv:2604.21873

Grounding Video Reasoning in Physical Signals

Alibay Osmanli, Zixu Cheng, Shaogang Gong

AI Summary

This paper introduces a new grounded benchmark for physical video understanding that evaluates models on their ability to localize events in time and space across diverse video sources and physics domains. The benchmark extends the V-STaR evaluation structure and includes four video sources (SSV2, YouCook2, HoloAssist, Roundabout-TAU), six physics domains, and various prompt families and input conditions. Results show that physics-based prompts are the strongest overall, spatial grounding is the weakest, and prompt-family robustness is selective, highlighting the need for more comprehensive evaluation metrics in video Q&A.

Key Contribution

Current video Q&A benchmarks can be fooled by textual regularities, failing to actually ground reasoning in the video's physical reality.

Abstract

Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video Q&A reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Grounding Video Reasoning in Physical Signals

Related Papers