Search papers, labs, and topics across Lattice.
This paper provides a technical tutorial for constructing enterprise-grade real-time voice agents, highlighting the limitations of end-to-end speech-to-speech models for real-time applications. It demonstrates that a cascaded streaming pipeline (STT -> LLM -> TTS) is crucial for achieving low latency. The tutorial implements a voice agent using Deepgram, vLLM, and ElevenLabs, achieving a P50 time-to-first-audio of 947ms, and releases the full codebase as a practical guide.
Forget slow, end-to-end models: building real-time voice agents hinges on a cascaded streaming pipeline, as demonstrated by a new tutorial achieving sub-second latency.
We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\rightarrow$ LLM $\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime''is not any single fast model but rather \textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.