Search papers, labs, and topics across Lattice.
This paper introduces a counterfactual event-binding protocol to address the issue of "almost-true" errors in long-video Omni assistants, where models misbind real evidence to incorrect speakers or moments. The authors create a benchmark, \bench, consisting of 3,600 QA items from long-form videos, revealing that open-weight Omni models significantly underperform compared to a closed-source reference in pair-level binding accuracy. To improve performance without modifying the model backbone, they propose Modality-Perturbation Reliability Calibration, which enhances accuracy in binding claims and improves performance on related benchmarks.
Open-weight Omni models struggle with binding accuracy, achieving only 41.55% on a new counterfactual benchmark, highlighting a critical gap in long-video comprehension.
Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level scoring can reward both a supported claim and its near-counterfactual. We introduce a counterfactual event-binding protocol that constructs paired supported/counterfactual claims from the same audio-visual event evidence and evaluates them by strict-pair accuracy. We instantiate it as \bench, a benchmark for long-video Omni hallucination, with 3{,}600 single-claim QA items from 638 long-form videos averaging 24.16 minutes and covering 256.87 hours. Under this protocol, open-weight Omni models remain weak at pair-level binding: Qwen2.5-Omni-7B reaches 32.06\% and Qwen3-Omni-Instruct reaches 41.55\%, versus 76.54\% for a closed-source reference. To narrow this gap without updating the backbone, we propose \method, Modality-Perturbation Reliability Calibration, a frozen-backbone framework that selects audio-negative probes within video-level folds and fuses their response shifts with native audio-visual confidence into per-claim support estimates. \method lifts Qwen2.5-Omni-7B to 36.22\% and Qwen3 to 51.09\% on \bench, and improves target-adapted MCQ accuracy on OmniVideoBench ($+$2.20) and WorldSense ($+$1.51) with Qwen3.