Search papers, labs, and topics across Lattice.
The authors investigate continued pretraining (CPT) of the Llama 3.1 8B model to enhance its Estonian language capabilities without sacrificing English performance. They perform CPT using a data mixture that increases Estonian exposure while maintaining the original training distribution through English replay and inclusion of code, mathematics, and instruction-like data. Subsequent supervised fine-tuning, preference optimization, and chat vector merging yielded significant improvements on Estonian benchmarks, demonstrating the effectiveness of their approach for improving single-language capabilities in multilingual LLMs.
Continued pretraining with a carefully balanced data mixture can substantially improve a multilingual LLM's performance on a low-resource language like Estonian without compromising its English capabilities.
Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.