Mar 5, 2026arXiv:2603.05413

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Jielin Qiu, Zixiang Chen, Liangwei Yang, Ming Zhu, Zhiwei Liu, Juntao Tan, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

AI Summary

This paper provides a technical tutorial for constructing enterprise-grade real-time voice agents, highlighting the limitations of end-to-end speech-to-speech models for real-time applications. It demonstrates that a cascaded streaming pipeline (STT -> LLM -> TTS) is crucial for achieving low latency. The tutorial implements a voice agent using Deepgram, vLLM, and ElevenLabs, achieving a P50 time-to-first-audio of 947ms, and releases the full codebase as a practical guide.

Key Contribution

Forget slow, end-to-end models: building real-time voice agents hinges on a cascaded streaming pipeline, as demonstrated by a new tutorial achieving sub-second latency.

Abstract

We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\rightarrow$ LLM $\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime''is not any single fast model but rather \textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.

Open-Source Models & Weights Speech & Audio Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Related Papers