Search papers, labs, and topics across Lattice.
This paper systematically evaluates the conversational reliability of LLMs in multi-turn interactions across three tasks: maintaining global constraints, selecting the correct tool/agent, and tracking structured entities. By comparing single-turn and multi-turn performance, the authors quantify the degradation in reliability under extended dialogue for both commercial and open-source models. The study reveals significant declines in reliability, especially for smaller models, and identifies failure modes like instruction drift, intent confusion, and contextual overwriting.
LLMs can suffer significant reliability drops in multi-turn conversations, revealing failure modes like instruction drift and intent confusion that challenge their dependable deployment.
Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.