KUMar 16, 2026arXiv:2603.14782

Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

Renhao Pei, Siyao Peng, Verena Blaschke, Robert Litschko, Barbara Plank

AI Summary

This paper introduces a new QA dataset designed to evaluate LLM performance on knowledge present in lower-resource language varieties (Cantonese and Bavarian) but absent in their higher-resource counterparts (Mandarin and German). Experiments reveal that LLMs struggle to answer questions relying on information solely available in the local Wikipedia editions. Providing context from the local Wikipedia lead sections significantly improves performance, highlighting the potential of these resources for enhancing LLM knowledge.

Key Contribution

LLMs often fail to access knowledge uniquely available in lower-resource language varieties, even when closely related to high-resource languages, revealing a significant information asymmetry.

Abstract

Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

Related Papers