Search papers, labs, and topics across Lattice.
This paper introduces RUMLEM, a dictionary-based lemmatizer for the Romansh language, covering its five main varieties and the standard form. RUMLEM leverages comprehensive morphological databases to achieve 77-84% word coverage in typical texts. The system also performs variety-aware language classification with 95% accuracy, demonstrating its utility in distinguishing between Romansh varieties and other languages.
A single lemmatizer now handles five Romansh varieties plus the standard form, achieving high accuracy in both lemmatization and dialect identification.
Lemmatization -- the task of mapping an inflected word form to its dictionary form -- is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.