CornellHITNTUOUCPolyUSJTUTencent AIUMichUSTCJun 8, 2026arXiv:2606.09295

NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

Hongkun Yang, Xinhui Yi, Xiyan Zhao, Yibo Meng, Lionel Z. Wang, Lixu Wang, Yaqi Zhang, Ruiqi Chen, Xuanyue Zhao, Lanxin Zhang, Yu Zeng, Weijia Chu, Yiming Ma, Chenyu Liu, Jianghao Lin, Xin Xu

AI Summary

This study introduces NüshuVoice, the first text-to-speech (TTS) system designed for the endangered Nüshu script, addressing the challenge of limited acoustic data by constructing a comprehensive sentence-level dataset that integrates text, phonetic transcriptions, and archival recordings. The authors propose Nüshu-PitchVITS, an F0-conditioned VITS framework that utilizes Nüshu's unique five-level pitch notation to enhance prosodic accuracy in speech synthesis. Experimental results demonstrate that Nüshu-PitchVITS significantly outperforms existing TTS baselines in terms of spectral fidelity, pitch reconstruction, and intelligibility as rated by human listeners.

Key Contribution

Nüshu-PitchVITS not only revives an endangered script but also sets a new benchmark in low-resource TTS systems by achieving superior speech synthesis quality through innovative pitch conditioning.

Abstract

Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a Nüshu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated syllable-level pronunciations rather than natural sentence-level utterances. In this work, we introduce NüshuVoice, the first TTS benchmark for Nüshu. We construct a sentence-level Nüshu text-to-audio dataset that aligns standardized Unicode Nüshu text, phonetic transcriptions, standard Chinese translations, and archival recordings. To synthesize speech under this extreme low-resource setting, we propose Nüshu-PitchVITS, an F0-conditioned VITS framework that leverages Nüshu's five-level pitch notation as an explicit prosodic inductive bias. Experimental results show that Nüshu-PitchVITS outperforms strong TTS baselines in spectral fidelity, pitch reconstruction, and human-rated intelligibility. We publicly release the dataset and code at: https://anonymous.4open.science/r/Nvshu-TTS-2EB6.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

Related Papers