Mar 9, 2026arXiv:2603.08216

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

AI Summary

DualTurn is introduced as a generative pretraining approach on dual-channel conversational audio to improve turn-taking in voice-based AI agents. The model autoregressively generates audio for both speakers, learning conversational dynamics without explicit labels, and is then fine-tuned to predict agent actions related to turn-taking. Experiments show DualTurn (0.5B) outperforms existing methods in agent action prediction and word-level turn prediction, demonstrating improved turn-taking anticipation with fewer interruptions.

Key Contribution

Silence timeouts are out: DualTurn learns natural turn-taking from unlabeled dual-channel audio, outperforming larger models and anticipating turns more accurately.

Abstract

Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers'future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

Related Papers