Search papers, labs, and topics across Lattice.
This paper investigates the safety architecture of OpenAI's GPT-4o mini on multimodal hate speech detection using the Hateful Memes Challenge dataset. The study identifies a "Unimodal Bottleneck" where context-blind safety filters preempt multimodal reasoning, leading to content policy refusals triggered equally by visual and textual content. The authors demonstrate the brittleness of this safety system, showing it blocks benign meme formats and highlighting a tension between capability and safety in LMMs.
GPT-4o's hate speech detection is hamstrung by a "Unimodal Bottleneck," where context-blind safety filters preempt its advanced multimodal reasoning, leading to predictable false positives.
As Large Multimodal Models (LMMs) become integral to daily digital life, understanding their safety architectures is a critical problem for AI Alignment. This paper presents a systematic analysis of OpenAI's GPT-4o mini, a globally deployed model, on the difficult task of multimodal hate speech detection. Using the Hateful Memes Challenge dataset, we conduct a multi-phase investigation on 500 samples to probe the model's reasoning and failure modes. Our central finding is the experimental identification of a"Unimodal Bottleneck,"an architectural flaw where the model's advanced multimodal reasoning is systematically preempted by context-blind safety filters. A quantitative validation of 144 content policy refusals reveals that these overrides are triggered in equal measure by unimodal visual 50% and textual 50% content. We further demonstrate that this safety system is brittle, blocking not only high-risk imagery but also benign, common meme formats, leading to predictable false positives. These findings expose a fundamental tension between capability and safety in state-of-the-art LMMs, highlighting the need for more integrated, context-aware alignment strategies to ensure AI systems can be deployed both safely and effectively.