Jan 6, 2026arXiv:2601.03135

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan, Christopher Driggers-Ellis, Christan Grant, Daisy Zhe Wang

AI Summary

This paper investigates synthetic data augmentation and language-specific preprocessing techniques to improve neural machine translation (NMT) for low-resource indigenous languages. They generate synthetic parallel data using a multilingual translation model and fine-tune an mBART model on both curated and augmented datasets. Results on Guarani-Spanish and Quechua-Spanish translation tasks demonstrate consistent chrF++ improvements from synthetic data augmentation and language-specific preprocessing.

Key Contribution

Synthetic data and tailored preprocessing can significantly boost machine translation for indigenous languages, but generic methods fall short for highly agglutinative languages like Aymara.

Abstract

Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.

Citation Metrics

Citations0

Influential citations0

References16

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Related Papers