Mar 12, 2026arXiv:2603.11683

Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2

AI Summary

This paper introduces a causal prosody mediation framework for expressive TTS, augmenting FastSpeech2 with emotion conditioning and counterfactual training objectives. The approach uses a structural causal model to disentangle emotional prosody from linguistic content, employing Indirect Path Constraint (IPC) and Counterfactual Prosody Constraint (CPC) loss terms. Experiments on multi-speaker emotional corpora demonstrate improved prosody manipulation, emotion rendering (higher MOS, emotion accuracy), intelligibility (low WER), and speaker consistency compared to baseline FastSpeech2 variants.

Key Contribution

Synthesize speech with unprecedented emotional control: a new causal training method lets you edit prosody "counterfactually" to express different emotions in the same utterance.

Abstract

We propose a novel causal prosody mediation framework for expressive text-to-speech (TTS) synthesis. Our approach augments the FastSpeech2 architecture with explicit emotion conditioning and introduces counterfactual training objectives to disentangle emotional prosody from linguistic content. By formulating a structural causal model of how text (content), emotion, and speaker jointly influence prosody (duration, pitch, energy) and ultimately the speech waveform, we derive two complementary loss terms: an Indirect Path Constraint (IPC) to enforce that emotion affects speech only through prosody, and a Counterfactual Prosody Constraint (CPC) to encourage distinct prosody patterns for different emotions. The resulting model is trained on multi-speaker emotional corpora (LibriTTS, EmoV-DB, VCTK) with a combined objective that includes standard spectrogram reconstruction and variance prediction losses alongside our causal losses. In evaluations on expressive speech synthesis, our method achieves significantly improved prosody manipulation and emotion rendering, with higher mean opinion scores (MOS) and emotion accuracy than baseline FastSpeech2 variants. We also observe better intelligibility (low WER) and speaker consistency when transferring emotions across speakers. Extensive ablations confirm that the causal objectives successfully separate prosody attribution, yielding an interpretable model that allows controlled counterfactual prosody editing (e.g."same utterance, different emotion") without compromising naturalness. We discuss the implications for identifiability in prosody modeling and outline limitations such as the assumption that emotion effects are fully captured by pitch, duration, and energy. Our work demonstrates how integrating causal learning principles into TTS can improve controllability and expressiveness in generated speech.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References10

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2

Related Papers