Microsoft ResearchPenn StateUSCMar 17, 2026arXiv:2603.17094

Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction

Ryo Kamoi, Ameya Godbole, Longqi Yang, Mengting Wan, Pei Zhou

AI Summary

The paper introduces CoCoEval, a framework using LLM-as-a-Judge to detect 10 types of inconsistent and uncollaborative behaviors in conversations. They compare the frequency of these behaviors in LLM-simulated (GPT-4.1, GPT-5.1, Claude Opus 4) versus human conversations across diverse settings. Results show LLMs underproduce these behaviors under vanilla prompting, prompt engineering is unreliable, and SFT can lead to overproduction of specific behaviors, suggesting LLMs are currently poor proxies for human social interaction.

Key Contribution

LLMs, even when prompted or fine-tuned, struggle to replicate the messy reality of human conversation, raising serious questions about their utility as proxies for social interaction.

Abstract

Simulating human conversations using large language models (LLMs) has emerged as a scalable methodology for modeling human social interaction. However, simulating human conversations is challenging because they inherently involve inconsistent and uncollaborative behaviors, such as misunderstandings and interruptions. Analysis comparing inconsistent and uncollaborative behaviors in human- and LLM-generated conversations remains limited, although reproducing these behaviors is integral to simulating human-like and complex social interaction. In this work, we introduce CoCoEval, an evaluation framework that analyzes LLM-simulated conversations by detecting 10 types of inconsistent and uncollaborative behaviors at the turn level using an LLM-as-a-Judge. Using CoCoEval, we evaluate GPT-4.1, GPT-5.1, and Claude Opus 4 by comparing the frequencies of detected behaviors in conversations simulated by each model and in human conversations across academic, business, and governmental meetings, as well as debates. Our analysis shows that (1) under vanilla prompting, LLM-simulated conversations exhibit far fewer inconsistent and uncollaborative behaviors than human conversations; (2) prompt engineering does not provide reliable control over these behaviors, as our results show that different prompts lead to their under- or overproduction; and (3) supervised fine-tuning on human conversations can lead LLMs to overproduce a narrow set of behaviors, such as repetition. Our findings highlight the difficulty of simulating human conversations, raising concerns about the use of LLMs as a proxy for human social interaction.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References52

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction

Related Papers