Search papers, labs, and topics across Lattice.
This paper introduces two new datasets, WinoMTeus and FLORES+Gender, to evaluate gender bias in machine translation involving Basque, a genderless language. WinoMTeus adapts the WinoMT benchmark to assess how gender-neutral Basque occupations are translated into gendered languages, while FLORES+Gender extends FLORES+ to examine translation quality variations based on the gender of the referent when translating into Basque. Experiments with LLMs and MT systems reveal a systematic preference for masculine forms and, in some cases, higher quality for masculine referents, demonstrating persistent gender bias.
Even when translating to and from a genderless language like Basque, machine translation models exhibit a systematic bias towards masculine forms, revealing a deeper issue than just dataset imbalances.
Large language models (LLMs) and machine translation (MT) systems are increasingly used in our daily lives, but their outputs can reproduce gender bias present in the training data. Most resources for evaluating such biases are designed for English and reflect its sociocultural context, which limits their applicability to other languages. This work addresses this gap by introducing two new datasets to evaluate gender bias in translations involving Basque, a low-resource and genderless language. WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French. FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent. We evaluate several general-purpose LLMs and open and proprietary MT systems. The results reveal a systematic preference for masculine forms and, in some models, a slightly higher quality for masculine referents. Overall, these findings show that gender bias is still deeply rooted in these models, and highlight the need to develop evaluation methods that consider both linguistic features and cultural context.