Search papers, labs, and topics across Lattice.
This paper addresses the challenge of machine translation for low-resource Turkic languages by generating and cleaning parallel corpora, then fine-tuning the NLLB-200 1.3B model. They created parallel corpora for six Turkic languages, ranging from 300,000 to 3,885,542 sentences, and demonstrated that cleaning and fine-tuning significantly improves translation quality. Results show substantial improvements in BLEU, chrF, WER, and TER scores compared to the baseline, validated by external and human evaluations.
Fine-tuning open-source models on synthetically generated and cleaned parallel corpora boosts machine translation quality for low-resource Turkic languages by up to 24 BLEU points.
This study presents the application of free, open-source artificial intelligence (AI) techniques to advance machine translation for low-resource Turkic languages such as Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. This machine translation problem for Turkic languages is part of a project to generate meeting minutes from speech transcripts. Due to limited parallel corpora and underdeveloped linguistic tools for these languages, traditional machine translation approaches often underperform. The goal is to reduce digital inequality for these languages and to support scalability. We investigate the effectiveness of free open-source pre-trained specialized and general-purpose AI models for morphologically rich state Turkic languages. This research includes developing parallel corpora for six Turkic languages, fine-tuning, and performance evaluation using BLEU, WER, TER, and chrF metrics. The parallel corpora for five pair languages, each of 300,000 and 500,000 sentences, were generated and cleaned. The results for corpora 500,000 parallel sentences show significant improvements compared with baseline NLLB-200 1.3B on average: BLEU increased by 23.81 points, chrF increased by 26.05 points, and WER and TER decreased by 0.36 and 33.95, respectively, after cleaning and fine-tuning. Six Turkic-language multilingual parallel corpora of 3 885 542 sentences were developed and the fine-tuning of NLLB-200 1.3B shows the following, compared with the results for 500,000 cleaned corpus: BLEU increased by 4.3 points, chrF increased by 1.7 points, and WER and TER decreased by 0.1 and 4.75, respectively These results demonstrate the high efficiency of corpus cleaning and synthetic data generation to improve the quality of machine translation for low-resource Turkic languages using AI models. These results were confirmed by external evaluation on the FLORES 200 dataset and human evaluation. The scientific contribution of this article is the development of a methodology for generating parallel corpora using a specialized AI model of machine translation and fine-tuning the specialized AI model on the created corpora, creating new multilingual parallel corpora of Azerbaijan–Kazakh, Kyrgyz–Kazakh, Turkish–Kazakh, Turkmen–Kazakh, and Uzbek–Kazakh pairs using the proposed methodology, cleaning them, and conducting fine-tuning experiments.