Feb 19, 2026arXiv:2602.17655

What Language is This? Ask Your Tokenizer

Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel

AI Summary

The paper introduces UniLID, a language identification (LID) method based on the UnigramLM tokenization algorithm, designed to improve performance in low-resource and closely related language settings. UniLID learns language-conditional unigram distributions over a shared tokenizer vocabulary, treating segmentation as language-specific. Experiments demonstrate that UniLID achieves competitive performance on standard benchmarks, significantly improves sample efficiency in low-resource scenarios, and delivers substantial gains in fine-grained dialect identification compared to baselines like fastText, GlotLID, and CLD3.

Key Contribution

Achieve 70% language identification accuracy with just five labeled samples per language using a novel tokenization-based approach.

Abstract

Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

What Language is This? Ask Your Tokenizer

Related Papers