NVIDIAPittMar 19, 2026arXiv:2603.19217

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Keda Tao, Keda Tao, Yuhua Zheng, Yuhua Zheng, Jia Xu, Jia Xu, Wenjie Du, Kele Shao, Kele Shao, Hesong Wang, Xueyi Chen, Xueyi Chen, Xin Jin, Xin Jin, Junhan Zhu, Bohan Yu, Bohan Yu, Weiqiang Wang, Jian Liu, Jian Liu, Can Qin, Can Qin, Yulun Zhang, Yulun Zhang, Ming-Hsuan Yang, Ming Yang, Huan Wang, Huan Wang

AI Summary

LVOmniBench, a new benchmark dataset, was created to evaluate the cross-modal comprehension abilities of OmniLLMs on long-form audio and video inputs, ranging from 10 to 90 minutes. The dataset consists of 275 videos and 1,014 question-answer pairs, designed to test long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Evaluation of current OmniLLMs on LVOmniBench reveals significant challenges, with open-source models achieving accuracies below 35% and Gemini 3 Pro peaking at approximately 65%, highlighting the need for improved models.

Key Contribution

Current OmniLLMs stumble when processing real-world, long-form audio-visual content, achieving only ~35-65% accuracy on a new benchmark designed to test long-term memory and fine-grained understanding.

Abstract

Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References110

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Related Papers