Search papers, labs, and topics across Lattice.
This paper audits the effectiveness of pretraining filters and inference-time guardrails in language models, revealing significant epistemic erasure, particularly affecting marginalized groups. The analysis of four pretraining filters and three guardrails shows a reliance on blocklist-based cues, leading to over-filtering of content related to transgender individuals, women, and Central Americans while failing to adequately address explicit hate speech. Human annotators would retain a substantial majority of flagged content, highlighting a disconnect between automated systems and nuanced human judgment regarding representational harms.
Language models are systematically erasing mentions of marginalized groups, with automated filters disproportionately flagging content related to transgender people and women.
Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5\% of filter-flagged and 91.3\% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.