UZHApr 13, 2026arXiv:2604.11233

RUMLEM: A Dictionary-Based Lemmatizer for Romansh

Dominic P. Fischer, Zachary Hopton, Jannis Vamvas

AI Summary

This paper introduces RUMLEM, a dictionary-based lemmatizer for the Romansh language, covering its five main varieties and the standard form. RUMLEM leverages comprehensive morphological databases to achieve 77-84% word coverage in typical texts. The system also performs variety-aware language classification with 95% accuracy, demonstrating its utility in distinguishing between Romansh varieties and other languages.

Key Contribution

A single lemmatizer now handles five Romansh varieties plus the standard form, achieving high accuracy in both lemmatization and dialect identification.

Abstract

Lemmatization -- the task of mapping an inflected word form to its dictionary form -- is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.

Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RUMLEM: A Dictionary-Based Lemmatizer for Romansh

Related Papers