Search papers, labs, and topics across Lattice.
This paper compares BLEU and ChrF++ metrics for evaluating machine translation quality in extremely low-resource language (ELRL) settings using outputs from LLMs and NMT systems. The study analyzes how each metric responds to common translation artifacts like hallucinations, repetitions, and source-text copying across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi. The results indicate that while ChrF++ is often favored, BLEU provides complementary lexical-precision insights, enhancing the interpretability of MT evaluation in ELRL scenarios.
Don't ditch BLEU for ChrF++ just yet: in extremely low-resource MT, BLEU's lexical precision offers crucial insights that ChrF++ misses.
Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.