Raad Khraishi

NatWest AI Research, T=L-1 and LL is the (fixed) number of assistant turns in the episode. This isolates a continuation problem: in every off-diagonal cell, the suffix model must produce a decisive final output while conditioning on a dialogue prefix authored by a different model, with the suffix responsible for exactly one turn across all cells. For any suffix model BB, the natural no-switch baseline is the diagonal cell (B→B)(B\!\rightarrow\!B), where BB authors the entire dialogue. We quantify the switch effect via paired per-episode differences, δA→B(e)=sA→B(e)−sB→B(e),\delta_{A\rightarrow B}(e)\;=\;s_{A\rightarrow B}(e)-s_{B\rightarrow B}(e), where s(⋅)s(\cdot) is the episode score under the benchmark metric. We summarize switch effects with the episode mean ΔA→B=𝔼e[δA→B(e)]\Delta_{A\rightarrow B}=\mathbb{E}_{e}[\delta_{A\rightarrow B}(e)]; negative ΔA→B\Delta_{A\rightarrow B} indicates that a prefix harms BB relative to the counterfactual where BB wrote its own context. We evaluate switching on two deterministic, automatically-scored multi-turn benchmarks targeting complementary failure modes: conversational grounding and cumulative constraint adherence. As our protocol requires running a full K×KK\times K switch matrix over hundreds of episodes, yielding a large number of model calls, we prioritize benchmarks with fast, lightweight environments and inexpensive, fully automatic scoring to make the evaluation computationally and financially tractable. CoQA (Reddy et al., 2019) is conversational question answering over a passage.111https://huggingface.co/datasets/stanfordnlp/coqa, University College London Abstract Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by −8-8 to +13+13 percentage points in Multi-IF strict success rate and ±4\pm 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT‑5‑nano vs GPT‑5‑mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ∼70%\mathord{\sim}70\% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems. 1 Introduction Large language models (LLMs) are increasingly deployed as interactive systems where users engage in multi-turn dialogues rather than single prompts, and performance depends on maintaining state, adhering to evolving constraints, and staying consistent as context grows (Han, 2025). Prior work has shown that accuracy and instruction-following can degrade across turns and that earlier context can strongly shape later behavior (Kwan et al., 2024; He et al., 2024; Hankache et al., 2025). Yet most evaluations still implicitly assume a fixed model throughout an interaction. In production, the model behind a conversation can change mid-session due to upgrades, cross-provider routing, or fallbacks (Chen et al., 2023), and even within a product line, updates can induce behavioral drift (Chen et al., 2024). Yet we lack direct measurements of what happens when a model must continue from a dialogue history authored by a different model. This handoff is a structured distribution shift: the suffix model conditions on a prefix generated by another model rather than by itself. As with embedding-model upgrades (Yoon and Arik, 2025), mismatched conventions (verbosity/format) and implicit commitments can propagate across turns, including under prompt-injection pressure (Chang et al., 2025). In this paper, we introduce a switch-matrix benchmark for multi-turn systems that measures handoff-induced drift when one model must continue another model’s conversation. Drift is computed via paired comparisons to a no-switch baseline, making the effect attributable to the handoff rather than episode variance. We evaluate on CoQA (Reddy et al., 2019) and Multi-IF (He et al., 2024; Zhou et al., 2023), two automatically-scored multi-turn benchmarks that stress conversational grounding and cumulative constraint adherence, using a diverse set of LLMs from leading providers, including Anthropic, OpenAI, and Google. Across both tasks, a single-turn handoff yields prevalent, statistically significant, and directional effects. In Multi-IF, higher-performing prefix models (higher no-switch score) can boost weaker suffixes by anchoring a compliant output protocol. In CoQA, drift persists even though the original text passage remains in the model’s context, suggesting a bias toward inherited assistant state rather than missing evidence. To enable compressed handoff risk monitoring, we also show drift is largely explained by two per-model factors: prefix influence and suffix susceptibility. We present, to our knowledge, the first cross-provider switch-matrix measurement study that isolates handoff-induced drift in multi-turn LLM systems via paired comparisons to a no-switch baseline. Our contributions are: (1) we formalize model switching as an operational source of drift in multi-turn LLM systems and introduce a switch-matrix protocol to measure it relative to no-switch baselines; (2) we provide an efficient evaluation harness with prefix caching and paired episode-level bootstrap analysis; (3) we report cross-model, cross-provider switch matrices on CoQA and Multi-IF, showing that even final-turn switching can induce measurable drift not predicted by single-model benchmark scores; and (4) we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, enabling compressed handoff-risk monitoring. 2 Methodology Let ℳ={m1,…,mK}\mathcal{M}=\{m_{1},\dots,m_{K}\} be a set of LLMs and let an episode ee denote a multi-turn benchmark instance (a dataset row executed in a fixed environment). For each ordered model pair (A,B)∈ℳ×ℳ(A,B)\in\mathcal{M}\times\mathcal{M} we run a context-switch cell (A→B)(A\!\rightarrow\!B): model AA generates the first TT assistant turns, then model BB generates the remaining turns until termination. We focus on a final-turn switch policy, where T=L−

Stanford HAI

Papers on Lattice

Total citations

Topics

h-index

Research focus

Eval Frameworks & Benchmarks (1)Natural Language Processing (1)

Frequent co-authors

Iman Zafar (1)Katie Myles (1)Greig A. Cowan (1)

Papers (1)

Mar 3, 2026

Stanford HAIMar 3, 2026·also Anthropic, NatWest AI Research

Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

Model handoffs in multi-turn LLM systems can swing performance by up to 13 percentage points, revealing a hidden reliability risk that single-model benchmarks miss.

Raad Khraishi, Iman Zafar, Katie Myles +1

Eval Frameworks & Benchmarks Natural Language Processing

Search

Raad Khraishi

Research focus

Frequent co-authors

Papers (1)