Search papers, labs, and topics across Lattice.
The paper introduces Emotion Transition-Aware Speech Captioning (EmoTransCap), a new paradigm for generating speech captions that incorporate temporal emotion dynamics at the discourse level. To facilitate this, the authors created a large-scale dataset using an automated pipeline, focusing on capturing emotion transitions within spoken discourse. They also developed a Multi-Task Emotion Transition Recognition (MTETR) model for joint emotion transition detection and diarization, leveraging LLMs for generating descriptive and instruction-oriented annotations.
Forget static emotion labels – EmoTransCap lets you generate speech captions that actually track how emotions evolve in a conversation.
Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.