University of LjubljanaMar 2, 2026arXiv:2603.01691

Building a Strong Instruction Language Model for a Less-Resourced Language

Domen Vreš, Tjaša Arčon, Timotej Petrič, Dario Vajda, Marko Robnik-Šikonja, Iztok Lebar Bajec

AI Summary

The authors adapted the Gemma 3 model to Slovene by using a three-stage continual pre-training approach, followed by a two-stage supervised fine-tuning (SFT) process. This involved training on a mixed dataset of Slovene, English, Bosnian, Serbian, and Croatian tokens for pre-training, and a combination of English and Slovene examples for SFT. The resulting model, GaMS3-12B, demonstrates superior performance compared to the original Gemma 3 and achieves competitive results against GPT-4o in Slovene language tasks.

Key Contribution

A carefully adapted 12B model can rival GPT-4o in a less-resourced language, proving that strategic training trumps sheer scale.

Abstract

Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Building a Strong Instruction Language Model for a Less-Resourced Language

Related Papers