Search papers, labs, and topics across Lattice.
This paper identifies the fragmented state of natural language processing (NLP) for Indic languages and proposes a unified computational architecture based on P膩nini's grammar, which underlies the morphosyntactic similarities among these languages. By introducing a four-part benchmark suite grounded in this P膩ninian framework, the authors aim to enhance the accuracy, data efficiency, and transferability of NLP systems across Indic languages. The key finding suggests that leveraging this shared linguistic structure could significantly improve the integration of sparse resources and facilitate the development of more robust language models.
A unified P膩ninian framework could revolutionize NLP for over a billion speakers by merging disparate language resources into a single, high-performance system.
More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks around individual languages or small subsets of genealogical language families, building separate analyzers, parsers, and datasets for each language and starting over for the next. This overlooks a deep regularity. Through more than two millennia of convergence around Sanskrit, Indic languages came to share a morphosyntactic architecture formalized in P膩nini's grammar, the Ast膩dhy膩y墨. This cuts across genealogical lines, uniting languages through a common framework. We argue that this P膩ninian framework supplies a unifying computational architecture the field has lacked, and that benchmarks grounded explicitly in it would make Indic language systems more accurate, more data-efficient, and more transferable, effectively merging many apparently disparate and sparse Indic language resources into a single high-resource metalanguage bedrock. We propose a four-part benchmark suite to render this shared architecture explicit, measurable, and ready to be leveraged for practical applications. Moreover, we underscore the question it raises for interpretability research: whether neural models trained on these languages come to represent P膩nini's categories on their own.