CASMar 18, 2026arXiv:2603.17303

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Bangju Han, Yingqi Wang, Huang Qing, Huang Qing, Tiyuan Li, Fengyi Yang, Ahtamjan Ahmat, Ahtamjan Ahmat, Abibulla Atawulla, Abibulla Atawulla, Yating Yang, Xi Zhou

AI Summary

The paper introduces CulT-Eval, a new benchmark for evaluating machine translation (MT) systems on their ability to handle culturally grounded expressions like idioms and slang. CulT-Eval comprises 7,959 instances and a novel error taxonomy to identify failures in capturing culturally induced meaning deviations. Experiments with large language models reveal that current MT systems struggle with cultural nuances, motivating a complementary evaluation metric to address the shortcomings of standard MT metrics.

Key Contribution

Current machine translation systems often fail to capture the nuances of culturally-loaded expressions, highlighting a critical gap in their ability to truly understand and translate language.

Abstract

Culture-expressions, such as idioms, slang, and culture-specific items (CSIs), are pervasive in natural language and encode meanings that go beyond literal linguistic form. Accurately translating such expressions remains challenging for machine translation systems. Despite this, existing benchmarks remain fragmented and do not provide a systematic framework for evaluating translation performance on culture-loaded expressions. To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions. CulT-Eval comprises over 7,959 carefully curated instances spanning multiple types of culturally grounded expressions, with a comprehensive error taxonomy covering culturally grounded expressions. Through extensive evaluation of large language models and detailed analysis, we identify recurring and systematic failure modes that are not adequately captured by existing automatic metrics. Accordingly, we propose a complementary evaluation metric that targets culturally induced meaning deviations overlooked by standard MT metrics. The results indicate that current models struggle to preserve culturally grounded meaning and to capture the cultural and contextual nuances essential for accurate translation. Our benchmark and code are available at https://anonymous.4open.science/r/CulT-Eval-E75D/.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Related Papers