Search papers, labs, and topics across Lattice.
This paper introduces the M2S-AVSR framework, which enhances audio-visual speech recognition by employing a multi-view representation learning encoder and a modality-aware module to address challenges such as viewpoint variation and audio distortion. By explicitly modeling modality quality and cross-modal synchrony, the framework achieves significant improvements in recognition accuracy, with up to 29.4% relative enhancement on the LRS3 dataset under adverse conditions. Additionally, the authors present the AISHELL8-RealScene dataset, establishing a new benchmark for robust AVSR in real-world scenarios, demonstrating the method's effectiveness across multiple languages and environments.
Achieving up to 29.4% improvement in speech recognition accuracy under challenging conditions, M2S-AVSR redefines robustness in audio-visual speech tasks.
Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.