KUNYUOxfordPrincetonWeizenbaum InstituteWorld BankApr 14, 2026arXiv:2604.12289

The Enforcement and Feasibility of Hate Speech Moderation on Twitter

Manuel Tonneau, Manuel Tonneau, D. Thurgood, Dylan Thurgood, Diyi Liu, Niyati Malhotra, Niyati Malhotra, Victor Orozco-Olvera, Victor Orozco-Olvera, Ralph Schroeder, Ralph Schroeder, Scott A. Hale, Scott Hale, Manoel Horta Ribeiro, Paul Rottger, Paul Röttger, Samuel P. Fraiberger, Samuel Fraiberger

AI Summary

This paper presents a large-scale audit of hate speech moderation on Twitter (now X) across eight languages, finding that 80% of hateful tweets remain online after five months. Surprisingly, neither the severity nor visibility of hateful tweets significantly increased their likelihood of removal. Through simulations of human-AI moderation pipelines, the authors demonstrate that reducing user exposure to hate speech is economically feasible, suggesting that the persistence of hate speech is due to resource allocation choices rather than technical limitations.

Key Contribution

Twitter's hate speech policies are failing, with hateful content no more likely to be removed than innocuous tweets, even when explicitly violent.

Abstract

Online hate speech is associated with substantial social harms, yet it remains unclear how consistently platforms enforce hate speech policies or whether enforcement is feasible at scale. We address these questions through a global audit of hate speech moderation on Twitter (now X). Using a complete 24-hour snapshot of public tweets, we construct representative samples comprising 540,000 tweets annotated for hate speech by trained annotators across eight major languages. Five months after posting, 80% of hateful tweets remain online, including explicitly violent hate speech. Such tweets are no more likely to be removed than non-hateful tweets, with neither severity nor visibility increasing the likelihood of removal. We then examine whether these enforcement gaps reflect technical limits of large-scale moderation systems. While fully automated detection systems cannot reliably identify hate speech without generating large numbers of false positives, they effectively prioritize likely violations for human review. Simulations of a human-AI moderation pipeline indicate that substantially reducing user exposure to hate speech is economically feasible at a cost below existing regulatory penalties. These results suggest that the persistence of online hate cannot be explained by technical constraints alone but also reflects institutional choices in the allocation of moderation resources.

Constitutional AI & AI Ethics Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Enforcement and Feasibility of Hate Speech Moderation on Twitter

Related Papers