Mosaic AIMar 2, 2026arXiv:2603.02206

VoiceAgengRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

Jielin Qiu, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming, Caiming, Xiong, Xiong, Silvio Savarese

AI Summary

The paper introduces VoiceAgentRAG, a dual-agent architecture designed to reduce latency in real-time voice agents that use Retrieval-Augmented Generation (RAG). By decoupling retrieval from response generation, the system uses a background "Slow Thinker" agent to proactively predict follow-up topics and pre-fetch relevant documents into a fast semantic cache. Experiments demonstrate that this approach significantly reduces latency by allowing a foreground "Fast Talker" agent to respond using only the pre-fetched cache when a relevant document is found.

Key Contribution

Real-time voice agents can bypass slow vector DB lookups with a dual-agent architecture that pre-fetches relevant documents into a sub-millisecond semantic cache.

Abstract

We present VoiceAgentRAG, an open-source dual-agent memory router that decouples retrieval from response generation. A background Slow Thinker agent continuously monitors the conversation stream, predicts likely follow-up topics using an LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. A foreground Fast Talker agent reads only from this sub-millisecond cache, bypassing the vector database entirely on cache hits.

Recommendation & Information Retrieval Speech & Audio Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References19

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VoiceAgengRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

Related Papers