Apr 14, 2026arXiv:2604.12911

Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

Ronald Skorobogat, Ronald Skorobogat, Ameya Prabhu, Ameya Prabhu, Matthias Bethge, Matthias Bethge

AI Summary

The paper argues that current multilingual benchmarks primarily measure reasoning and factual recall, not true multilingual proficiency, as evidenced by the superior performance of "thinking" variants over "instruct" variants despite the latter's better real-world performance. To address this, they propose round-trip translation (RTT) as a more direct measure of multilingual capability, where text is translated to a target language and back, and semantic gaps are assessed. They introduce the Lost in Translation (LiT) benchmark, demonstrating that RTT correlates strongly with real-world multilingual task performance and doesn't require human reference translations or a more capable judge.

Key Contribution

Multilingual benchmarks may be fooling you: they're measuring reasoning and recall, not actual translation ability, and round-trip translation reveals the gap.

Abstract

Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

Related Papers