Mar 19, 2026arXiv:2603.18758

Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning

Hung-Yue Suen, Kuo-En Hung, Fan-Hsun Tseng

AI Summary

This paper introduces a speaker-centric Emotion AI approach for predicting audience affective engagement and vocal attractiveness in video learning, using only speaker-side affective expressions. Two regression models were trained on a large MOOC corpus: one using facial dynamics, oculomotor features, prosody, and semantics to predict affective engagement, and another using acoustic features to predict vocal attractiveness. The models achieved high predictive performance on speaker-independent test sets (R2 = 0.85 and 0.88, respectively), demonstrating the feasibility of inferring audience feedback from speaker behavior alone.

Key Contribution

You can predict how engaged and attracted viewers are to a video lecture just by analyzing the speaker's face and voice, no audience data needed.

Abstract

This paper outlines a machine learning-enabled speaker-centric Emotion AI approach capable of predicting audience-affective engagement and vocal attractiveness in asynchronous video-based learning, relying solely on speaker-side affective expressions. Inspired by the demand for scalable, privacy-preserving affective computing applications, this speaker-centric Emotion AI approach incorporates two distinct regression models that leverage a massive corpus developed within Massive Open Online Courses (MOOCs) to enable affectively engaging experiences. The regression model predicting affective engagement is developed by assimilating emotional expressions emanating from facial dynamics, oculomotor features, prosody, and cognitive semantics, while incorporating a second regression model to predict vocal attractiveness based exclusively on speaker-side acoustic features. Notably, on speaker-independent test sets, both regression models yielded impressive predictive performance (R2 = 0.85 for affective engagement and R2 = 0.88 for vocal attractiveness), confirming that speaker-side affect can functionally represent aggregated audience feedback. This paper provides a speaker-centric Emotion AI approach substantiated by an empirical study discovering that speaker-side multimodal features, including acoustics, can prospectively forecast audience feedback without necessarily employing audience-side input information.

Computer Vision Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning

Related Papers