Search papers, labs, and topics across Lattice.
This paper introduces a paralinguistic- and emotion-aware model for detecting optimal validation timing in Japanese empathetic spoken dialogue, foregoing reliance on textual context. The model leverages continued self-supervised training and fine-tuning of HuBERT backbones to create paralinguistics-aware and emotion classification encoders, which are then fused. Experiments on the TUT Emotional Storytelling Corpus (TESC) demonstrate significant improvements over speech-based baselines, highlighting the importance of non-linguistic cues for empathetic human-robot interaction.
You can predict the best moment to offer emotional support just by listening to someone's voice, no text needed.
Emotional Validation is a psychotherapy communication technique that involves recognizing, understanding, and explicitly acknowledging another person's feelings and actions, which strengthens alliance and reduces negative affect. To maximize the emotional support provided by validation, it is crucial to deliver it with appropriate timing and frequency. This study investigates validation timing detection from the speech perspective. Leveraging both paralinguistic and emotional information, we propose a paralinguistic- and emotion-aware model for validation timing detection without relying on textual context. Specifically, we first conduct continued self-supervised training and fine-tuning on different HuBERT backbones to obtain (i) a paralinguistics-aware Self-Supervised Learning (SSL) encoder and (ii) a multi-task speech emotion classification encoder. We then fuse these encoders and further fine-tune the combined model on the downstream validation timing detection task. Experimental evaluations on the TUT Emotional Storytelling Corpus (TESC) compare multiple models, fusion mechanisms, and training strategies, and demonstrate that the proposed approach achieves significant improvements over conventional speech baselines. Our results indicate that non-linguistic speech cues, when integrated with affect-related representations, carry sufficient signal to decide when validation should be expressed, offering a speech-first pathway toward more empathetic human-robot interaction.