Search papers, labs, and topics across Lattice.
This paper addresses the problem of evaluating conversational naturalness in multi-turn, two-speaker dialogues, which is not well-addressed by existing single-speaker naturalness predictors. The authors demonstrate that existing naturalness estimators correlate poorly with human judgments of conversational naturalness. They then introduce a dual-channel naturalness estimator leveraging pre-trained encoders and data augmentation, achieving significantly improved correlation with human ratings in both in-domain and out-of-domain settings.
Existing speech naturalness predictors fall flat when judging multi-turn conversations, but a new dual-channel estimator closes the gap with human perception.
Evaluation of conversational naturalness is essential for developing human-like speech agents. However, existing speech naturalness predictors are often designed to assess utterances from a single speaker, failing to capture conversation-level naturalness qualities. In this paper, we present a framework for an automatic naturalness predictor for two-speaker, multi-turn conversations. We first show that existing naturalness estimators have low, or sometimes even negative, correlations with conversational naturalness, based on conversational recordings annotated with human ratings. We then propose a dual-channel naturalness estimator, in which we investigate multiple pre-trained encoders with data augmentation. Our proposed model achieves substantially higher correlation with human judgments compared to existing naturalness predictors for both in-domain and out-of-domain conditions.