May 5, 2026arXiv:2605.03742

Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus

AI Summary

This paper introduces the Tajik Web Corpus, the largest open-access corpus of Tajik text (1.11B characters), and benchmarks parameter-efficient fine-tuning (PEFT) methods for Tajik text generation using various LLMs. The study compares full fine-tuning, LoRA, and QLoRA across autoregressive, encoder-decoder, and encoder-only models, evaluating perplexity, cross-entropy loss, memory usage, and training time. Mistral 7B with QLoRA (r=16) achieved the best perplexity (5.03), while full fine-tuning of smaller GPT-2 models resulted in lower perplexity but induced catastrophic forgetting.

Key Contribution

Forget scaling laws: QLoRA-tuned Mistral 7B crushes other architectures for low-resource Tajik text generation, highlighting the importance of architecture choice in PEFT.

Abstract

This paper is devoted to the adaptation of generative large language models for the Tajik language, a low-resource language with Cyrillic script. To overcome the shortage of digital text resources, the author created and publicly released the Tajik Web Corpus, the largest open-access corpus of Tajik, comprising 319,298 documents (~1.11 billion characters). On a subsample of 10,000 documents, 17 configurations were benchmarked, covering autoregressive, encoder-decoder, and encoder-only models with three fine-tuning strategies: full fine-tuning, LoRA, and QLoRA (ranks 8 and 16). Quality was assessed via perplexity and cross-entropy loss; peak GPU memory and training time were also recorded. Best results were achieved by Mistral 7B with QLoRA (r=16): mean perplexity 5.03, standard deviation 0.03. Increasing rank from 8 to 16 gave statistically insignificant improvement while raising memory consumption. For small GPT-2 family models, full fine-tuning yielded lower perplexity (3.48 for GPT-2 Medium) than LoRA (7.60-8.42), but induced catastrophic forgetting. The encoder-only XLM-RoBERTa showed the worst results (perplexity 59.3). The novelty lies in creating the largest verified Tajik corpus and the first systematic analysis of PEFT effectiveness for Tajik text generation. Practical value lies in recommendations for architecture and fine-tuning strategy selection, optimizing computational costs without substantial quality loss.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References18

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus

Related Papers