CMU MLNTTJun 16, 2026arXiv:2606.17542

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Ryo Fukuda, Takatomo Kano, Siddhant Arora, Marc Delcroix, Naohiro Tawara, Atsunori Ogawa, Yuya Chiba, Atsushi Ando, William Chen, Shinji Watanabe

AI Summary

This study evaluates large language models (LLMs) on their ability to predict addressees, turn changes, and the next speaker in multimodal, multi-party conversations. Using the AMI corpus, the authors found that LLMs surpassed both supervised models and human participants in next speaker prediction, despite lacking training on the specific domain and access to audio-visual cues. The results highlight the importance of conversational context, particularly for next speaker prediction, while also revealing that multimodal LLMs still struggle to fully leverage raw audio-visual signals compared to human performance.

Key Contribution

LLMs can outperform humans in predicting the next speaker in meetings, even without audio or visual data.

Abstract

We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Related Papers