NUDTJun 2, 2026arXiv:2606.03614

OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination

Zixuan Dong, Jiafu Tang, Zhide Lei, Zhide Lei, Zhe Cao, Zhe Cao, Zijie Zhang, Yanghai Wang, Xiaodong Wang, Baoyun Peng, Baoyun Peng, Jiaheng Liu

AI Summary

This paper introduces a counterfactual event-binding protocol to address the issue of "almost-true" errors in long-video Omni assistants, where models misbind real evidence to incorrect speakers or moments. The authors create a benchmark, \bench, consisting of 3,600 QA items from long-form videos, revealing that open-weight Omni models significantly underperform compared to a closed-source reference in pair-level binding accuracy. To improve performance without modifying the model backbone, they propose Modality-Perturbation Reliability Calibration, which enhances accuracy in binding claims and improves performance on related benchmarks.

Key Contribution

Open-weight Omni models struggle with binding accuracy, achieving only 41.55% on a new counterfactual benchmark, highlighting a critical gap in long-video comprehension.

Abstract

Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level scoring can reward both a supported claim and its near-counterfactual. We introduce a counterfactual event-binding protocol that constructs paired supported/counterfactual claims from the same audio-visual event evidence and evaluates them by strict-pair accuracy. We instantiate it as \bench, a benchmark for long-video Omni hallucination, with 3{,}600 single-claim QA items from 638 long-form videos averaging 24.16 minutes and covering 256.87 hours. Under this protocol, open-weight Omni models remain weak at pair-level binding: Qwen2.5-Omni-7B reaches 32.06\% and Qwen3-Omni-Instruct reaches 41.55\%, versus 76.54\% for a closed-source reference. To narrow this gap without updating the backbone, we propose \method, Modality-Perturbation Reliability Calibration, a frozen-backbone framework that selects audio-negative probes within video-level folds and fuses their response shifts with native audio-visual confidence into per-claim support estimates. \method lifts Qwen2.5-Omni-7B to 36.22\% and Qwen3 to 51.09\% on \bench, and improves target-adapted MCQ accuracy on OmniVideoBench ($+$2.20) and WorldSense ($+$1.51) with Qwen3.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination

Related Papers