Mar 11, 2026arXiv:2603.10652

Are Video Reasoning Models Ready to Go Outside?

AI Summary

The paper introduces ROVA, a training framework designed to enhance the robustness of video reasoning models against real-world disturbances like weather and occlusion by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA employs a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability, using self-reflective evaluation to continuously re-estimate sample difficulty. Evaluated on a new benchmark, PVRBench, as well as UrbanVideo and VisBench, ROVA demonstrates significant improvements in accuracy and reasoning under realistic perturbations, outperforming baseline models by at least 24% in accuracy and 9% in reasoning.

Key Contribution

Video reasoning models can suffer up to a 35% drop in accuracy and 28% in reasoning quality under real-world perturbations, but a new training framework, ROVA, mitigates this by adaptively prioritizing informative samples.

Abstract

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Are Video Reasoning Models Ready to Go Outside?

Related Papers