IdiapApr 7, 2026arXiv:2604.06487

Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Thibault Bañeras-Roux, Sergio Gastón Burdisso, Esaú Villatoro-Tello, Dairazalia S'anchez-Cort'es, S'everin Baroudi, Shashi Kumar, Hasindri Watawana, E. ManjunathK, Kadri Hacioglu, Petr Motlícek, A. Stolcke

AI Summary

This paper investigates the modality gap in LLM-based ASR systems that are adapted using text-only data. They compare text-only adaptation, paired speech-text adaptation, and mixed batching (MB) to determine if small amounts of speech can mitigate the modality mismatch. Results show that MB using only 10% of target-domain speech achieves word error rates comparable to or better than conventional ASR fine-tuning with the full dataset, demonstrating the effectiveness of small amounts of speech for modality alignment.

Key Contribution

Just 4 hours of speech data closes the modality gap in LLM-based ASR, rivaling full-dataset fine-tuning and unlocking effective domain adaptation.

Abstract

Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.

Multimodal Models Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Related Papers