LuxembourgFeb 12, 2026arXiv:2602.11795

A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

Anne-Marie Lutgen, Alistair Plum, Christoph Purschke

AI Summary

This paper introduces a subword embedding approach to detect lexical and orthographic variation in user-generated text, specifically addressing the challenges of "noisy" and low-resource settings without relying on normalization or predefined variant lists. The method trains subword embeddings on raw Luxembourgish user comments and clusters related forms using a combination of cosine similarity and n-gram similarity. The results demonstrate the effectiveness of distributional modeling in uncovering meaningful patterns of variation, aligning with existing dialectal and sociolinguistic research.

Key Contribution

Uncover hidden linguistic structure in "noisy" user comments with a novel subword embedding approach that reveals lexical and orthographic variations without any manual normalization.

Abstract

This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in''noisy''or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments

Related Papers