Search papers, labs, and topics across Lattice.
The paper introduces LilMoo, a 0.6B parameter language model for Hindi, trained from scratch using a transparent and reproducible pipeline. They construct a high-quality Hindi corpus (GigaLekh) using heuristic and LLM-as-a-judge filtering, along with bilingual augmentation. LilMoo outperforms comparably sized multilingual models, demonstrating the effectiveness of language-specific pretraining for low-resource languages.
Forget massive multilingual models: LilMoo proves a carefully trained 0.6B Hindi model can beat them at their own game.
The dominance of large multilingual foundation models has widened linguistic inequalities in Natural Language Processing (NLP), often leaving low-resource languages underrepresented. This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained entirely from scratch to address this gap. Unlike prior Hindi models that rely on continual pretraining from opaque multilingual foundations, LilMoo is developed through a fully transparent and reproducible pipeline optimized for limited compute environments. We construct a high-quality Hindi corpus (GigaLekh) filtered through both heuristic and learned (LLM-as-a-judge) methods, complemented by bilingual augmentation with curated English data. Using this dataset, we explore various training recipes for small-scale language models. Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines such as Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that well-designed language-specific pretraining can rival large multilingual models at the sub-billion-parameter range.