Valentina Pyatkin

Allen Institute for AI, Allen Institute for AI & Nathan Lambert2 &Hannaneh Hajishirzi1,2 Abstract Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval. TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities Victoria Graf1,2

Allen Institute for AI (AI2)

Papers on Lattice

Total citations

Topics

h-index

Research focus

Eval Frameworks & Benchmarks (1)RLHF & Preference Learning (1)

Frequent co-authors

Saumya Malik (1)Sander Land (1)Jacob Daniel Morrison (1)Noah A. Smith (1)

Papers (1)

Jun 2, 2025

AI2Jun 2, 2025·also UW

RewardBench 2: Advancing Reward Model Evaluation

RewardBench 2 exposes a stark reality check for reward models: they struggle significantly on new, human-generated prompts, yet this difficulty is surprisingly predictive of their actual usefulness in downstream tasks.

Saumya Malik, Valentina Pyatkin, Sander Land +453

Eval Frameworks & Benchmarks RLHF & Preference Learning

Search

Valentina Pyatkin

Research focus

Frequent co-authors

Papers (1)