Search papers, labs, and topics across Lattice.
This paper introduces ReNikud, a novel approach to grapheme-to-phoneme (G2P) conversion for Modern Hebrew that leverages weak audio supervision and a pseudo-vocalization architecture to address the challenges posed by the language's abjad writing system. By utilizing a phoneme-based automatic speech recognition pipeline on extensive unlabeled audio data, ReNikud generates phonemic transcriptions that accurately reflect natural spoken language, overcoming limitations of traditional methods that rely on scarce vocalization data. The results demonstrate that ReNikud outperforms existing state-of-the-art G2P systems on both established benchmarks and a new targeted benchmark for spoken Hebrew, indicating its effectiveness for applications like text-to-speech.
Weak audio supervision allows ReNikud to achieve superior grapheme-to-phoneme conversion for Hebrew, outperforming traditional methods that struggle with data scarcity and pronunciation accuracy.
Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial ambiguity. Standard approaches first predict vowel diacritics (nikud) to produce International Phonetic Alphabet (IPA) transcriptions, but this is limited: vocalization data is scarce and laborious to produce, it does not specify features such as lexical stress, and it reflects formal grammatical rules rather than everyday spoken pronunciation. Direct sequence-to-sequence IPA prediction, meanwhile, struggles on limited data and fails to exploit the character-level alignment characteristic of abjads. Our method, ReNikud, overcomes these limitations with two key insights: (1) Weak audio supervision via a phoneme-based automatic speech recognition (ASR) pseudo-labeling pipeline on thousands of hours of unlabeled Hebrew audio, yielding phonemic transcriptions that reflect natural spoken norms without manual annotation. (2) A pseudo-vocalization architecture that predicts IPA phonemes at each character position, enforcing character-level alignment as an inductive bias. Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods. We will release our code and trained models to support further work on Hebrew TTS and speech technologies.