B-Think Polygl0t/Tucano2-qwen-3.Bonn-Aachen International Center for InformationHelmholtzLamarr Institute for Machine LearningMar 3, 2026arXiv:2603.03508

Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi

Shiza Fatimah, Aniket Sen, Sophia Falk, Florian Mai, Lucie Flek, Nicholas Kluge Corrêa

AI Summary

The paper introduces LilMoo, a 0.6B parameter language model for Hindi, trained from scratch using a transparent and reproducible pipeline. They construct a high-quality Hindi corpus (GigaLekh) using heuristic and LLM-as-a-judge filtering, along with bilingual augmentation. LilMoo outperforms comparably sized multilingual models, demonstrating the effectiveness of language-specific pretraining for low-resource languages.

Key Contribution

Forget massive multilingual models: LilMoo proves a carefully trained 0.6B Hindi model can beat them at their own game.

Abstract

The dominance of large multilingual foundation models has widened linguistic inequalities in Natural Language Processing (NLP), often leaving low-resource languages underrepresented. This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained entirely from scratch to address this gap. Unlike prior Hindi models that rely on continual pretraining from opaque multilingual foundations, LilMoo is developed through a fully transparent and reproducible pipeline optimized for limited compute environments. We construct a high-quality Hindi corpus (GigaLekh) filtered through both heuristic and learned (LLM-as-a-judge) methods, complemented by bilingual augmentation with curated English data. Using this dataset, we explore various training recipes for small-scale language models. Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines such as Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that well-designed language-specific pretraining can rival large multilingual models at the sub-billion-parameter range.

Natural Language Processing Open-Source Models & Weights Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi

Related Papers