Search papers, labs, and topics across Lattice.
The CARTE benchmark evaluates large language models (LLMs) on their ability to reason about regionally differentiated knowledge within France, addressing a significant gap in existing benchmarks that focus solely on national-level cultural understanding. By introducing 2,431 questions across 13 metropolitan regions and 14 thematic domains, the study highlights the nuanced cultural and linguistic variations that LLMs must navigate. Results indicate that performance varies significantly across regions and model sizes, revealing systematic deficiencies in pretraining that affect model robustness to intra-national differences.
LLMs exhibit significant regional performance disparities, revealing critical gaps in their pretraining that could hinder their application in culturally diverse contexts.
We introduce CARTE 1 (Culturally Anchored Regional-Territorial Evaluation), a multiplechoice benchmark for evaluating the ability of large language models (LLMs) to perform fine-grained reasoning over geographically grounded and regionally differentiated knowledge within France. While prior benchmarks focus on national-level cultural understanding, they largely overlook intra-country variation and the need to distinguish between closely related regional contexts. CARTE addresses this gap by introducing 2,431 questions spanning the 13 metropolitan regions of France and covering 14 thematic domains, including culture, language, demographics, economy, environment, and mobility. We further introduce CARTE-LV, a subset targeting Linguistic Variation across French regions, enabling focused evaluation of language-related differences. We evaluate 27 LLMs ranging from 1B to 12B parameters under few-shot settings. Our experiments reveal performance disparities across regions and model scales, suggesting systematic gaps in pretraining coverage and limited robustness to intra-national variation.