AnthropicMar 3, 2026arXiv:2603.03111

Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

Raad Khraishi, Iman Zafar, Katie Myles, Greig A. Cowan

AI Summary

The paper introduces a novel "switch-matrix benchmark" to quantify performance drift in multi-turn LLM systems caused by model handoffs during conversations. They evaluate the impact of switching models mid-dialogue on CoQA and Multi-IF benchmarks, comparing performance against a no-switch baseline using bootstrap confidence intervals. Results show statistically significant performance variations (up to -8 to +13 percentage points in Multi-IF success rate) due to model switching, highlighting the importance of handoff robustness as a critical operational reliability factor.

Key Contribution

Model handoffs in multi-turn LLM systems can swing performance by up to 13 percentage points, revealing a hidden reliability risk that single-model benchmarks miss.

Abstract

Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References10

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

Related Papers