Mar 10, 2026arXiv:2603.09512

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani

AI Summary

This paper investigates the reliability of Vision-Language Models (VLMs) as driving assistants, focusing on response consistency and temporal reasoning. The authors identify that VLMs often exhibit inconsistent responses to minor input perturbations and struggle with temporally grounded reasoning, even when possessing strong visual understanding. To mitigate these issues, they introduce FutureVQA, a human-annotated benchmark dataset, and propose a self-supervised tuning approach with chain-of-thought reasoning, demonstrating improvements in both consistency and temporal reasoning without temporal labels.

Key Contribution

VLMs that excel at visual understanding can still fail at driving tasks requiring temporal reasoning, revealing an over-reliance on pretrained patterns instead of modeling dynamics.

Abstract

A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

Related Papers