Sep 17, 2025arXiv:2509.13608

Is GPT-4o mini Blinded by its Own Safety Filters? Exposing the Multimodal-to-Unimodal Bottleneck in Hate Speech Detection

AI Summary

This paper investigates the safety architecture of OpenAI's GPT-4o mini on multimodal hate speech detection using the Hateful Memes Challenge dataset. The study identifies a "Unimodal Bottleneck" where context-blind safety filters preempt multimodal reasoning, leading to content policy refusals triggered equally by visual and textual content. The authors demonstrate the brittleness of this safety system, showing it blocks benign meme formats and highlighting a tension between capability and safety in LMMs.

Key Contribution

GPT-4o's hate speech detection is hamstrung by a "Unimodal Bottleneck," where context-blind safety filters preempt its advanced multimodal reasoning, leading to predictable false positives.

Abstract

As Large Multimodal Models (LMMs) become integral to daily digital life, understanding their safety architectures is a critical problem for AI Alignment. This paper presents a systematic analysis of OpenAI's GPT-4o mini, a globally deployed model, on the difficult task of multimodal hate speech detection. Using the Hateful Memes Challenge dataset, we conduct a multi-phase investigation on 500 samples to probe the model's reasoning and failure modes. Our central finding is the experimental identification of a"Unimodal Bottleneck,"an architectural flaw where the model's advanced multimodal reasoning is systematically preempted by context-blind safety filters. A quantitative validation of 144 content policy refusals reveals that these overrides are triggered in equal measure by unimodal visual 50% and textual 50% content. We further demonstrate that this safety system is brittle, blocking not only high-risk imagery but also benign, common meme formats, leading to predictable false positives. These findings expose a fundamental tension between capability and safety in state-of-the-art LMMs, highlighting the need for more integrated, context-aware alignment strategies to ensure AI systems can be deployed both safely and effectively.

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References20

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Is GPT-4o mini Blinded by its Own Safety Filters? Exposing the Multimodal-to-Unimodal Bottleneck in Hate Speech Detection

Related Papers