BirminghamExeterUniversity of LiverpoolDec 2, 2025arXiv:2512.02743

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, Zeyu Fu

AI Summary

The paper introduces Reasoning-Aware Multimodal Fusion (RAMF), a novel framework for detecting hateful content in online videos by explicitly modeling semantic relationships between modalities and reasoning about nuanced hateful intent. RAMF employs Local-Global Context Fusion (LGCF) and Semantic Cross Attention (SCA) to capture local and global contexts and enable fine-grained multimodal interaction. The framework also uses adversarial reasoning to generate objective descriptions, hate-assumed inferences, and non-hate-assumed inferences, enriching contextual understanding.

Key Contribution

By generating contrasting inferences about hatefulness, a new adversarial reasoning approach significantly boosts hateful video detection, outperforming existing methods by up to 7% in hate class recall.

Abstract

Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References44

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Related Papers