Apr 30, 2026arXiv:2604.27866

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Doyeop Kwak, Jeongsoo Choi, Suyeon Lee, Joon Son Chung

AI Summary

The paper introduces LRS-VoxMM, a new audio-visual speech recognition (AVSR) benchmark derived from the VoxMM dataset, featuring diverse, real-world spoken conversations. LRS-VoxMM addresses the limitations of existing benchmarks by incorporating a wider range of scenarios, acoustic conditions, and distorted evaluation sets with noise, reverberation, and bandwidth limitations. Experiments demonstrate that LRS-VoxMM is significantly more challenging than LRS3, highlighting the increased importance of visual information in degraded audio conditions.

Key Contribution

Visual cues become crucial for speech recognition when audio quality tanks in this challenging new benchmark derived from real-world conversations.

Abstract

We introduce LRS-VoxMM, an in-the-wild benchmark for audio-visual speech recognition (AVSR). The benchmark is derived from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. We select AVSR-suitable samples and preprocess them in an LRS-style format for direct use in existing AVSR pipelines. Compared with commonly used benchmarks, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. We also release distorted evaluation sets with additive noise, reverberation, and bandwidth limitation to support evaluation under severe acoustic degradation. Experimental results show that LRS-VoxMM is considerably harder than LRS3 and that the contribution of visual information becomes more evident as the audio signal degrades. LRS-VoxMM supports more realistic AVSR benchmarking and encourages further research on the role of visual information in challenging real-world conditions.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Related Papers