TAUUZHMar 16, 2026arXiv:2603.15969

Robust Language Identification for Romansh Varieties

Charlotte Model, Sina Ahmadi, Jannis Vamvas

AI Summary

This paper introduces a Support Vector Machine (SVM) based language identification (LID) system tailored for distinguishing between Romansh idioms, including the supra-regional Rumantsch Grischun. The system is evaluated on a newly curated benchmark dataset across two domains. Results show a high in-domain accuracy of 97%, demonstrating the feasibility of automated Romansh idiom identification.

Key Contribution

A 97% accurate Romansh idiom classifier unlocks idiom-aware NLP tools for a low-resource language.

Abstract

The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References18

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Robust Language Identification for Romansh Varieties

Related Papers