Search papers, labs, and topics across Lattice.
This paper investigates the mathematical reasoning capabilities of LLMs in low-resource languages (Sinhala and Tamil) compared to English, using a newly constructed parallel dataset of math problems. The study assesses whether LLMs genuinely reason in these languages or rely on implicit translation to English-like representations across six math problem types. Results show that while basic arithmetic reasoning transfers well, performance on complex reasoning tasks degrades significantly in Tamil and Sinhala, indicating non-uniform reasoning capabilities across languages.
LLMs that ace math in English stumble badly in Sinhala and Tamil, revealing that multilingual competence doesn't guarantee uniform reasoning across languages.
Large language models (LLMs) demonstrate strong mathematical reasoning in English, but whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear. We examine this fundamental question by evaluating whether LLMs genuinely reason mathematically in these languages or depend on implicit translation to English-like representations. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset where each problem is natively authored by fluent speakers with mathematical training in all three languages. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that apparent multilingual competence may not reflect uniform reasoning capabilities across languages. These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.