DAMOBeijing Language and Culture UniversityELLISHITIBM ResearchLMUMBZUAISUSTechUniversity of TurkuApr 21, 2026arXiv:2604.19262

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Wenjiang Luo, Haotian Ye, Md Mehrab Hossain, Chunlan Ma, Shaoxiong Ji, Younes Samih, Dilda Duisenbek, Adrian Neo Sau Xun, Daria Pozdniakova, Liubou Misevich, Nevena Marinković, Ngoc Gia Linh Nguyen, Thi Khanh Linh Do, Sarakmatak Sophy, Baotian Hu, Guanhua Chen, Gongbo Tang, Alham Fikri Aji, Weihua Luo

AI Summary

The paper introduces CulturALL, a new benchmark designed to evaluate LLMs' multilingual and multicultural competence in grounded, real-world scenarios across 14 languages and 51 regions. CulturALL distinguishes itself from existing benchmarks by focusing on context-rich tasks requiring reasoning, rather than generic language understanding or cultural trivia. Experiments reveal that even the best LLMs achieve only 44.48% accuracy on CulturALL, highlighting a significant gap in current models' ability to handle grounded multilingual and multicultural tasks.

Key Contribution

LLMs still struggle to reason in context when cultural and linguistic nuances are involved, achieving only 44% accuracy on a new grounded benchmark spanning 14 languages.

Abstract

Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Related Papers