Mar 15, 2026arXiv:2603.14432

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

AI Summary

Affectron, a framework for emotional speech synthesis, addresses the challenge of generating diverse and contextually aligned nonverbal vocalizations (NVs) by using an NV-augmented training strategy on a small-scale corpus to expand NV distribution. The framework incorporates NV structural masking into a speech backbone pre-trained on purely verbal speech, facilitating diverse and natural NV synthesis. Experiments show that Affectron generates more expressive and diverse NVs while maintaining verbal speech naturalness compared to baselines.

Key Contribution

Injecting nonverbal cues like laughter and sighs into speech synthesis is now more expressive and natural, thanks to a novel training strategy that overcomes data scarcity.

Abstract

Nonverbal vocalizations (NVs), such as laughter and sighs, are central to the expression of affective cues in emotional speech synthesis. However, learning diverse and contextually aligned NVs remains challenging in open settings due to limited NV data and the lack of explicit supervision. Motivated by this challenge, we propose Affectron as a framework for affective and contextually aligned NV generation. Built on a small-scale open and decoupled corpus, Affectron introduces an NV-augmented training strategy that expands the distribution of NV types and insertion locations. We further incorporate NV structural masking into a speech backbone pre-trained on purely verbal speech to enable diverse and natural NV synthesis. Experimental results demonstrate that Affectron produces more expressive and diverse NVs than baseline systems while preserving the naturalness of the verbal speech stream.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

Related Papers