HamburgJobMatchMe GmbHMay 28, 2026arXiv:2605.30214

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

AI Summary

The paper introduces GRUFF, a large-scale dataset for evaluating pronoun fidelity in German, considering its complex gender agreement systems and pronoun sets. Experiments using GRUFF reveal that LLMs exhibit strong grammatical agreement for masculine and feminine entities without context, but struggle with neopronouns. The study also finds that occupational stereotypes are poorly correlated across grammatical cases and models, highlighting the nuances of bias in German language models.

Key Contribution

LLMs confidently misgender neopronouns in German, even while correctly gendering common nouns, revealing a critical gap in their ability to handle gender-inclusive language.

Abstract

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assesses models'abilities to correctly reuse a previously-specified pronoun for a discourse entity, independent of other potentially distracting discourse entities mentioned in between. However, such research focuses on English, which is a language with limited grammatical gender and almost no gender agreement. In this paper we contribute a novel, large-scale dataset, GRUFF, to measure pronoun fidelity in German, covering four different gender agreement systems in nouns, and four sets of pronouns. With this dataset, we show that LLMs show strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Finally, we show that occupational stereotypes in this context are poorly correlated across grammatical cases, and across most models, except ones with closely related architectures. We release all code and data to encourage further work on gender-inclusive language and referential reasoning in German.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

Related Papers