Search papers, labs, and topics across Lattice.
This study evaluates large language models (LLMs) on their ability to predict addressees, turn changes, and the next speaker in multimodal, multi-party conversations. Using the AMI corpus, the authors found that LLMs surpassed both supervised models and human participants in next speaker prediction, despite lacking training on the specific domain and access to audio-visual cues. The results highlight the importance of conversational context, particularly for next speaker prediction, while also revealing that multimodal LLMs still struggle to fully leverage raw audio-visual signals compared to human performance.
LLMs can outperform humans in predicting the next speaker in meetings, even without audio or visual data.
We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.