Mar 17, 2026arXiv:2603.16759

TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

Valentina Pyatkin, Nathan Lambert, Hannaneh Hajishirzi

AI Summary

The authors introduce TurnWiseEval, a new benchmark for evaluating multi-turn conversational abilities of language models, designed for direct comparison with single-turn performance. They also present TurnWiseData, a synthetic data pipeline for generating multi-turn training data. Experiments with Olmo 3 demonstrate that training with even a small amount (10k) of multi-turn data significantly improves multi-turn chat performance, achieving a 12% improvement on TurnWiseEval.

Key Contribution

Language models can get a 12% boost in multi-turn conversation quality from just 10k examples of multi-turn training data, highlighting the critical gap between single-turn and multi-turn capabilities.

Abstract

Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References19

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

Related Papers