Mar 9, 2026arXiv:2603.08046

WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation

Zihao Fang, Ying Shen, Yingda Shen, Zifan Guan, Tongtong Song, Zhenyi Liu, Zhizheng Wu

AI Summary

The paper introduces WhispEar, a bidirectional framework for whisper-to-normal (W2N) and normal-to-whisper (N2W) speech conversion leveraging unified semantic representations. A key innovation is the N2W model's ability to generate pseudo-parallel whispered speech from normal speech data, enabling scalable data augmentation for W2N training. Experiments on a newly released bilingual whispered-normal parallel corpus show that WhispEar outperforms baselines and benefits from increased pseudo-parallel data.

Key Contribution

Unlock whisper-to-normal speech conversion with a clever trick: synthesize whispered speech from readily available normal speech data to massively augment training.

Abstract

Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation

Related Papers