Mar 10, 2026arXiv:2603.09701

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

Binquan Zhang, Li Zhang, Lin Shi, Song Wang, Yuwei Qian, Linhui Zhao, Fang Liu, An Fu, Yida Ye

AI Summary

This paper introduces the concept of "Interaction Smells" to characterize quality issues in multi-turn human-LLM code generation, focusing on problems beyond functional correctness. Through manual analysis of real-world interaction logs from WildChat and LMSYS-Chat-1M, the authors establish a taxonomy of Interaction Smells categorized into User Intent Quality, Historical Instruction Compliance, and Historical Response Violation. They then quantitatively evaluate six mainstream LLMs and propose Invariant-aware Constraint Evolution (InCE), a multi-agent framework that improves interaction quality by extracting global invariants and performing pre-generation quality audits, demonstrating improved task success rates and reduced Interaction Smells.

Key Contribution

LLMs in collaborative coding often stumble on interaction subtleties, leading to a new class of problems called "Interaction Smells" that can now be systematically identified and mitigated.

Abstract

Large Language Models (LLMs) have revolutionized code generation, evolving from static tools into dynamic conversational interfaces that facilitate complex, multi-turn collaborative programming. While LLMs exhibit remarkable proficiency in generating standalone code snippets, they often struggle to maintain contextual consistency during extended interactions, creating significant obstacles in the collaboration process. Existing benchmarks primarily emphasize the functional correctness of the final output, overlooking latent quality issues within the interaction process itself, which we term Interaction Smells. In this paper, we conduct an empirical study on sampled real-word user-LLM interactions from WildChat and LMSYS-Chat-1M datasets to systematically investigate Interaction Smells in human-LLM code generation tasks from the perspectives of phenomena, distribution, and mitigation. First, we establish the first taxonomy of Interaction Smells by manually performing open card sorting on real-world interaction logs. This taxonomy categorizes Interaction Smells into three primary categories, i.e., User Intent Quality, Historical Instruction Compliance, and Historical Response Violation, comprising nine specific subcategories. Next, we quantitatively evaluate six mainstream LLMs (i.e., GPT-4o, DeepSeek-Chat, Gemini 2.5, Qwen2.5-32B, Qwen2.5-72B, and Qwen3-235B-a22b) to analyze the distribution of Interaction Smells across different models. Finally, we propose Invariant-aware Constraint Evolution (InCE), a multi-agent framework designed to improve multi-turn interaction quality through explicit extraction of global invariants and pre-generation quality audits. Experimental results on the extended WildBench benchmark demonstrate that this lightweight mitigation approach significantly improves the Task Success Rate and effectively suppresses the occurrence of Interaction Smells.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

Related Papers