Saarland UniversityApr 13, 2026arXiv:2604.11803

Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus

Lena S. Oberkircher, Jesujoba O. Alabi, Dietrich Klakow, Jürgen Trouvain

AI Summary

The paper introduces Saar-Voice, a 6-hour speech corpus of the Saarbrücken dialect of German, created from digitized books and local materials. The corpus includes recordings from nine speakers and provides aligned textual and audio representations. Analysis of the dataset highlights challenges related to orthographic and speaker variation, and explores grapheme-to-phoneme conversion for dialectal speech.

Key Contribution

Dialect-specific speech datasets like Saar-Voice can help bridge the performance gap between standardized language models and real-world linguistic diversity.

Abstract

Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset's characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware text-to-speech (TTS), particularly in low-resource scenarios, including zero-shot and few-shot model adaptation.

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus

Related Papers