NTUFeb 26, 2026arXiv:2602.22659

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

Renyu Yang, Renyu Yang, Jian Jin, Lili Meng, Meiqin Liu, Meiqin Liu, Yilin Wang, Yilin Wang, Balu Adsumilli, Balu Adsumilli, Weisi Lin

AI Summary

The authors address the limitations of existing audio-visual quality assessment (AVQA) datasets by introducing a crowdsourced subjective experiment framework coupled with a systematic data preparation strategy to generate a large and diverse dataset. This approach overcomes the constraints of in-lab settings and ensures reliable annotation across varied environments while also covering a broad range of quality levels and semantic scenarios. The resulting YT-NTU-AVQ dataset, comprising 1,620 user-generated audio and video sequences, is the largest and most diverse AVQA dataset to date, enabling research on multimodal perception mechanisms.

Key Contribution

The YT-NTU-AVQ dataset, 10x larger than previous AVQA datasets, unlocks new possibilities for training and evaluating multimodal perception models by offering unprecedented scale and diversity.

Abstract

Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

Related Papers