Search papers, labs, and topics across Lattice.
The authors introduce CoMMET, a new multimodal benchmark dataset designed to evaluate Theory of Mind (ToM) capabilities in LLMs across a broader range of mental states and in multi-turn conversational settings. They evaluate various LLMs using CoMMET, revealing their strengths and limitations in social reasoning. The results provide insights into the current state of LLMs' ToM abilities and highlight areas for future research to enhance their social cognitive skills.
LLMs still struggle with social intelligence, as shown by a new multimodal benchmark revealing limitations in their ability to reason about mental states in multi-turn conversations.
Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.