Search papers, labs, and topics across Lattice.
This paper introduces the Tajik Web Corpus, the largest open-access corpus of Tajik text (1.11B characters), and benchmarks parameter-efficient fine-tuning (PEFT) methods for Tajik text generation using various LLMs. The study compares full fine-tuning, LoRA, and QLoRA across autoregressive, encoder-decoder, and encoder-only models, evaluating perplexity, cross-entropy loss, memory usage, and training time. Mistral 7B with QLoRA (r=16) achieved the best perplexity (5.03), while full fine-tuning of smaller GPT-2 models resulted in lower perplexity but induced catastrophic forgetting.
Forget scaling laws: QLoRA-tuned Mistral 7B crushes other architectures for low-resource Tajik text generation, highlighting the importance of architecture choice in PEFT.
This paper is devoted to the adaptation of generative large language models for the Tajik language, a low-resource language with Cyrillic script. To overcome the shortage of digital text resources, the author created and publicly released the Tajik Web Corpus, the largest open-access corpus of Tajik, comprising 319,298 documents (~1.11 billion characters). On a subsample of 10,000 documents, 17 configurations were benchmarked, covering autoregressive, encoder-decoder, and encoder-only models with three fine-tuning strategies: full fine-tuning, LoRA, and QLoRA (ranks 8 and 16). Quality was assessed via perplexity and cross-entropy loss; peak GPU memory and training time were also recorded. Best results were achieved by Mistral 7B with QLoRA (r=16): mean perplexity 5.03, standard deviation 0.03. Increasing rank from 8 to 16 gave statistically insignificant improvement while raising memory consumption. For small GPT-2 family models, full fine-tuning yielded lower perplexity (3.48 for GPT-2 Medium) than LoRA (7.60-8.42), but induced catastrophic forgetting. The encoder-only XLM-RoBERTa showed the worst results (perplexity 59.3). The novelty lies in creating the largest verified Tajik corpus and the first systematic analysis of PEFT effectiveness for Tajik text generation. Practical value lies in recommendations for architecture and fine-tuning strategy selection, optimizing computational costs without substantial quality loss.