Mar 29, 2026arXiv:2603.27838

ProText: A benchmark dataset for measuring (mis)gendering in long-form texts

Hadas Kotek, Margit Bowler, Patrick Sonnenberg, Yu'an Yang

AI Summary

ProText is introduced as a new dataset designed to evaluate gendering and misgendering biases in long-form text transformations performed by LLMs. The dataset covers theme nouns, theme categories (stereotypically male/female/neutral), and pronoun categories, enabling nuanced analysis of gender bias beyond pronoun resolution. A case study using summarization and rewriting tasks revealed systematic gender biases in LLMs, especially in the absence of explicit gender cues or when models default to heteronormative assumptions.

Key Contribution

LLMs exhibit systematic gender bias and heteronormative assumptions when processing long-form text, even in the absence of explicit gender cues.

Abstract

We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendered), and Pronoun category (masculine, feminine, gender-neutral, none). The dataset is designed to probe (mis)gendering in text transformations such as summarization and rewrites using state-of-the-art Large Language Models, extending beyond traditional pronoun resolution benchmarks and beyond the gender binary. We validated ProText through a mini case study, showing that even with just two prompts and two models, we can draw nuanced insights regarding gender bias, stereotyping, misgendering, and gendering. We reveal systematic gender bias, particularly when inputs contain no explicit gender cues or when models default to heteronormative assumptions.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ProText: A benchmark dataset for measuring (mis)gendering in long-form texts

Related Papers