Search papers, labs, and topics across Lattice.
The paper introduces FlowW2N, a whispered-to-normal speech conversion method using conditional flow matching trained on synthetic, time-aligned whisper-normal pairs. It leverages domain-invariant ASR embeddings to generalize to real whispers without requiring real paired data, addressing the temporal misalignment and data scarcity challenges in W2N conversion. FlowW2N achieves state-of-the-art intelligibility on CHAINS and wTIMIT datasets, significantly reducing Word Error Rate compared to previous methods.
Achieve SOTA whispered-to-normal speech conversion by training exclusively on synthetic data, bridging the gap to real-world whispers with domain-invariant ASR embeddings.
Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.