Greig A. Cowan

NatWest AI Research, T=L-1 and LL is the (fixed) number of assistant turns in the episode. This isolates a continuation problem: in every off-diagonal cell, the suffix model must produce a decisive final output while conditioning on a dialogue prefix authored by a different model, with the suffix responsible for exactly one turn across all cells. For any suffix model BB, the natural no-switch baseline is the diagonal cell (B→B)(B\!\rightarrow\!B), where BB authors the entire dialogue. We quantify the switch effect via paired per-episode differences, δA→B(e)=sA→B(e)−sB→B(e),\delta_{A\rightarrow B}(e)\;=\;s_{A\rightarrow B}(e)-s_{B\rightarrow B}(e), where s(⋅)s(\cdot) is the episode score under the benchmark metric. We summarize switch effects with the episode mean ΔA→B=𝔼e[δA→B(e)]\Delta_{A\rightarrow B}=\mathbb{E}_{e}[\delta_{A\rightarrow B}(e)]; negative ΔA→B\Delta_{A\rightarrow B} indicates that a prefix harms BB relative to the counterfactual where BB wrote its own context. We evaluate switching on two deterministic, automatically-scored multi-turn benchmarks targeting complementary failure modes: conversational grounding and cumulative constraint adherence. As our protocol requires running a full K×KK\times K switch matrix over hundreds of episodes, yielding a large number of model calls, we prioritize benchmarks with fast, lightweight environments and inexpensive, fully automatic scoring to make the evaluation computationally and financially tractable. CoQA (Reddy et al., 2019) is conversational question answering over a passage.111https://huggingface.co/datasets/stanfordnlp/coqa

Stanford HAI

Papers on Lattice

Total citations

Topics

h-index

Research focus

Eval Frameworks & Benchmarks (1)Natural Language Processing (1)

Frequent co-authors

Raad Khraishi (1)Iman Zafar (1)Katie Myles (1)

Papers (1)

Mar 3, 2026

Stanford HAIMar 3, 2026·also Anthropic, NatWest AI Research

Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

Model handoffs in multi-turn LLM systems can swing performance by up to 13 percentage points, revealing a hidden reliability risk that single-model benchmarks miss.

Raad Khraishi, Iman Zafar, Katie Myles +1

Eval Frameworks & Benchmarks Natural Language Processing

Search

Greig A. Cowan

Research focus

Frequent co-authors

Papers (1)