Search papers, labs, and topics across Lattice.
The paper introduces AfriEconQA, a new benchmark dataset for African economic analysis constructed from 236 World Bank reports, designed to evaluate numerical reasoning and temporal disambiguation capabilities of models. The dataset comprises 8,937 question-answer pairs, filtered from a larger synthetic pool to ensure high-quality evidence-answer alignment and temporal provenance. Benchmarking experiments using GPT-5 Mini, GPT-4o, and Qwen 32B in zero-shot and RAG configurations reveal a significant performance gap, highlighting the dataset's challenge for current LLMs and the need for domain-specific IR and RAG advancements.
LLMs' impressive general knowledge evaporates when faced with African economic data, as even advanced RAG pipelines struggle to answer questions based on World Bank reports, revealing a stark domain-specific knowledge gap.
We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.