Search papers, labs, and topics across Lattice.
This paper compares the performance and cost of long-context LLMs against fact-based memory systems (using Mem0) for persistent conversational agents on memory-centric benchmarks. Results show that long-context LLMs achieve higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2. A cost model incorporating prompt caching reveals that memory systems become cheaper than long-context LLMs after a certain number of turns, especially as context length increases, offering a practical criterion for choosing between the two.
Fact-based memory systems can become more cost-effective than long-context LLMs for persistent agents after just ten interaction turns when dealing with 100k token contexts, challenging the assumption that simply scaling context length is always the best approach.
Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system's per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.