B-itUMDJun 4, 2026arXiv:2606.05763

M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

AI Summary

This paper introduces the M2S-AVSR framework, which enhances audio-visual speech recognition by employing a multi-view representation learning encoder and a modality-aware module to address challenges such as viewpoint variation and audio distortion. By explicitly modeling modality quality and cross-modal synchrony, the framework achieves significant improvements in recognition accuracy, with up to 29.4% relative enhancement on the LRS3 dataset under adverse conditions. Additionally, the authors present the AISHELL8-RealScene dataset, establishing a new benchmark for robust AVSR in real-world scenarios, demonstrating the method's effectiveness across multiple languages and environments.

Key Contribution

Achieving up to 29.4% improvement in speech recognition accuracy under challenging conditions, M2S-AVSR redefines robustness in audio-visual speech tasks.

Abstract

Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

Related Papers