Search papers, labs, and topics across Lattice.
The paper investigates the ability of seven state-of-the-art language models to understand slang in Indian English (en-IN) and Australian English (en-AU). They created two datasets, \textsc{web} (web-sourced slang usages) and \textsc{gen} (synthetically generated slang usages), and evaluated models on target word prediction (TWP), guided target word prediction (TWP$^*$), and target word selection (TWS) tasks. Results showed that models performed better on TWS than TWP and TWP$^*$, better on the \textsc{web} dataset, and better on en-IN compared to en-AU, highlighting performance asymmetries in generative and discriminative tasks for variety-specific slang.
LLMs struggle with slang, especially Australian English, revealing surprising gaps in their language understanding despite being trained on vast amounts of English text.
Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textsc{web}, containing 377 web-sourced usage examples from Urban Dictionary, and \textsc{gen}, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^*$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^*$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textsc{web} versus \textsc{gen} datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.