Feb 16, 2026arXiv:2602.14675

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

AI Summary

The authors introduce a crowdsourced dataset of 145 Italian-Piedmontese parallel sentences, focusing on non-standard Piedmontese orthography, to evaluate LLM performance. They benchmarked LLMs on tokenization, topic classification, and machine translation using this dataset. The results show a tokenization penalty for Piedmontese but near-parity classification performance, and demonstrate asymmetric translation capabilities with better translation *from* Piedmontese than *to* it.

Key Contribution

LLMs can classify endangered languages almost as well as high-resource ones, but still struggle to generate text in these languages fluently, even with parallel training data.

Abstract

We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Related Papers