Mar 9, 2026arXiv:2603.08249

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Pol Buitrago, Pol Galvez, Pol Gàlvez, Oriol Pareras, Javier Hernando

AI Summary

This paper introduces a zero-shot AVSR framework that leverages synthetically generated visual streams by lip-syncing static facial images with real audio to overcome the lack of labeled video data for under-resourced languages. The method fine-tunes a pre-trained AV-HuBERT model using over 700 hours of synthetic talking-head video. Experiments on Catalan, a language with no annotated audiovisual corpora, demonstrate near state-of-the-art performance with fewer parameters and training data, outperforming an audio-only baseline and maintaining robustness in noisy environments.

Key Contribution

Unlock AV speech recognition for any language, even with zero labeled video data, by training on synthetically generated talking-head videos.

Abstract

Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Related Papers