Search papers, labs, and topics across Lattice.
NatWest AI Research, T=L-1 and LL is the (fixed) number of assistant turns in the episode. This isolates a continuation problem: in every off-diagonal cell, the suffix model must produce a decisive final output while conditioning on a dialogue prefix authored by a different model, with the suffix responsible for exactly one turn across all cells. For any suffix model BB, the natural no-switch baseline is the diagonal cell (B→B)(B\!\rightarrow\!B), where BB authors the entire dialogue. We quantify the switch effect via paired per-episode differences, δA→B(e)=sA→B(e)−sB→B(e),\delta_{A\rightarrow B}(e)\;=\;s_{A\rightarrow B}(e)-s_{B\rightarrow B}(e), where s(⋅)s(\cdot) is the episode score under the benchmark metric. We summarize switch effects with the episode mean ΔA→B=𝔼e[δA→B(e)]\Delta_{A\rightarrow B}=\mathbb{E}_{e}[\delta_{A\rightarrow B}(e)]; negative ΔA→B\Delta_{A\rightarrow B} indicates that a prefix harms BB relative to the counterfactual where BB wrote its own context. We evaluate switching on two deterministic, automatically-scored multi-turn benchmarks targeting complementary failure modes: conversational grounding and cumulative constraint adherence. As our protocol requires running a full K×KK\times K switch matrix over hundreds of episodes, yielding a large number of model calls, we prioritize benchmarks with fast, lightweight environments and inexpensive, fully automatic scoring to make the evaluation computationally and financially tractable. CoQA (Reddy et al., 2019) is conversational question answering over a passage.111https://huggingface.co/datasets/stanfordnlp/coqa
Stanford HAI1
0
2
2
Model handoffs in multi-turn LLM systems can swing performance by up to 13 percentage points, revealing a hidden reliability risk that single-model benchmarks miss.