TU MunichUniversity of CaliforniaApr 22, 2026arXiv:2604.20460

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

Xingcheng Zhou, Hao Guo, Rui Song, Walter Zimmer, Mingyu Liu, André Schamschurko, Hu Cao

AI Summary

The authors introduce CCTVBench, a new benchmark for evaluating the contrastive consistency of video LLMs in safety-critical traffic scenarios using real accident videos paired with world-model-generated counterfactuals. They find a significant gap between per-instance QA performance and contrastive consistency across various models, highlighting issues with "none-of-the-above" rejection. To mitigate this, they propose C-TCD, a contrastive decoding method that leverages semantically exclusive counterpart videos during inference, improving both QA accuracy and consistency.

Key Contribution

Video LLMs can ace individual traffic video questions but still fail spectacularly at subtle counterfactual reasoning, revealing a critical blind spot for safety-critical applications.

Abstract

Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, a contrastive decoding approach leveraging a semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

Related Papers