Google ResearchTechnionFeb 15, 2026arXiv:2602.14080

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon, Nitay Calderon, Eyal Ben-David, Eyal Ben-David, Zorik Gekhman, Zorik Gekhman, Eran Ofek, E. Ofek, Gal Yona, G. Yona

AI Summary

This paper introduces a framework to decompose factual errors in LLMs into encoding failures (empty shelves) and recall failures (lost keys), arguing that current evaluations conflate the two. They create WikiProfile, a benchmark generated via an automated pipeline using a prompted LLM grounded in web search, to profile factual knowledge at the fact level. Experiments on 13 LLMs reveal that while encoding is nearly saturated in frontier models like GPT-5 and Gemini-3, recall remains a significant bottleneck, particularly for long-tail facts and reverse questions.

Key Contribution

LLMs like GPT-5 and Gemini-3 already "know" almost everything (95-98% factual encoding), but struggle to recall it, suggesting that future gains in factuality depend more on better memory retrieval than on simply scaling up.

Abstract

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Related Papers