Search papers, labs, and topics across Lattice.
The paper introduces VoiceAgentRAG, a dual-agent architecture designed to reduce latency in real-time voice agents that use Retrieval-Augmented Generation (RAG). By decoupling retrieval from response generation, the system uses a background "Slow Thinker" agent to proactively predict follow-up topics and pre-fetch relevant documents into a fast semantic cache. Experiments demonstrate that this approach significantly reduces latency by allowing a foreground "Fast Talker" agent to respond using only the pre-fetched cache when a relevant document is found.
Real-time voice agents can bypass slow vector DB lookups with a dual-agent architecture that pre-fetches relevant documents into a sub-millisecond semantic cache.
We present VoiceAgentRAG, an open-source dual-agent memory router that decouples retrieval from response generation. A background Slow Thinker agent continuously monitors the conversation stream, predicts likely follow-up topics using an LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. A foreground Fast Talker agent reads only from this sub-millisecond cache, bypassing the vector database entirely on cache hits.