Search papers, labs, and topics across Lattice.
The paper introduces TurkicNLP, an open-source Python library designed to provide a unified NLP pipeline for Turkic languages, addressing the current fragmentation in tooling and resources. It offers functionalities including tokenization, morphological analysis, POS tagging, dependency parsing, NER, transliteration, sentence embeddings, and machine translation through a language-agnostic API. The library's modular architecture integrates rule-based and neural models, supports multiple script families, and ensures interoperability through CoNLL-U standard outputs.
Finally, a single Python library unlocks a consistent NLP pipeline for the diverse Turkic language family, spoken by over 200 million people and written in four different scripts.
Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .