Search papers, labs, and topics across Lattice.
The paper argues that current multilingual benchmarks primarily measure reasoning and factual recall, not true multilingual proficiency, as evidenced by the superior performance of "thinking" variants over "instruct" variants despite the latter's better real-world performance. To address this, they propose round-trip translation (RTT) as a more direct measure of multilingual capability, where text is translated to a target language and back, and semantic gaps are assessed. They introduce the Lost in Translation (LiT) benchmark, demonstrating that RTT correlates strongly with real-world multilingual task performance and doesn't require human reference translations or a more capable judge.
Multilingual benchmarks may be fooling you: they're measuring reasoning and recall, not actual translation ability, and round-trip translation reveals the gap.
Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.