Search papers, labs, and topics across Lattice.
This paper investigates the performance of Large Language Models (LLMs) on non-standard language varieties, specifically South Tyrolean dialects and Kurdish, from both computational and sociolinguistic perspectives. It examines how LLMs can be adapted to handle linguistic variation and whether such adaptations contribute to more democratic and decolonial digital strategies. The authors combine computational linguistics techniques with critical sociolinguistics to analyze the challenges and opportunities in processing non-standard languages with GenAI.
LLMs' struggles with non-standard languages aren't just a technical problem, but reflect and reinforce historical power imbalances embedded in linguistic standardization.
The design of Large Language Models and generative artificial intelligence has been shown to be"unfair"to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as"monolithic, monolingual, syntactically standardized systems of meaning". In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires--South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish--as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to"democratic and decolonial digital and machine learning strategies", which has direct policy implications.