Stony BrookJun 23, 2026arXiv:2606.24172

A Pāninian Foundation for Indic Language Processing

AI Summary

This paper identifies the fragmented state of natural language processing (NLP) for Indic languages and proposes a unified computational architecture based on Pānini's grammar, which underlies the morphosyntactic similarities among these languages. By introducing a four-part benchmark suite grounded in this Pāninian framework, the authors aim to enhance the accuracy, data efficiency, and transferability of NLP systems across Indic languages. The key finding suggests that leveraging this shared linguistic structure could significantly improve the integration of sparse resources and facilitate the development of more robust language models.

Key Contribution

A unified Pāninian framework could revolutionize NLP for over a billion speakers by merging disparate language resources into a single, high-performance system.

Abstract

More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks around individual languages or small subsets of genealogical language families, building separate analyzers, parsers, and datasets for each language and starting over for the next. This overlooks a deep regularity. Through more than two millennia of convergence around Sanskrit, Indic languages came to share a morphosyntactic architecture formalized in Pānini's grammar, the Astādhyāyī. This cuts across genealogical lines, uniting languages through a common framework. We argue that this Pāninian framework supplies a unifying computational architecture the field has lacked, and that benchmarks grounded explicitly in it would make Indic language systems more accurate, more data-efficient, and more transferable, effectively merging many apparently disparate and sparse Indic language resources into a single high-resource metalanguage bedrock. We propose a four-part benchmark suite to render this shared architecture explicit, measurable, and ready to be leveraged for practical applications. Moreover, we underscore the question it raises for interpretability research: whether neural models trained on these languages come to represent Pānini's categories on their own.

Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Pāninian Foundation for Indic Language Processing

Related Papers