Apr 28, 2026arXiv:2604.26136

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

AI Summary

This paper tackles cross-lingual voice cloning for scientific speech, a challenging task due to domain specificity and the need to preserve speaker identity. The authors fine-tuned the OmniVoice foundation model using data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. Results show that fine-tuning with this synthetic data improves intelligibility (WER and CER) across Arabic, Chinese, and French, while maintaining speaker similarity.

Key Contribution

Synthetically generated data from multi-model ensemble distillation can significantly boost the intelligibility of cross-lingual voice cloning systems for scientific speech without sacrificing speaker similarity.

Abstract

Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating consistent improvements in intelligibility (WER and CER) across languages while preserving speaker similarity.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

Related Papers