Search papers, labs, and topics across Lattice.
The paper addresses the challenge of detecting hateful memes in the low-resource Bengali language by augmenting the existing Bengali Hateful Memes (BHM) dataset with semantically similar samples from the MIMOSA dataset to improve class balance and diversity. They introduce the Enhanced Dual Co-attention Framework (xDORA), which integrates vision and multilingual text encoders using weighted attention pooling, and further enhance it with retrieval-augmented generation (RAG) for contextual reasoning. Experiments demonstrate that xDORA, particularly when fused with RAG, achieves state-of-the-art performance on the extended dataset for both hateful meme identification and target entity detection, outperforming LLaVA in few-shot settings.
Retrieval-augmented models beat large vision-language models like LLaVA on Bengali hateful meme detection, suggesting that fine-tuning and retrieval are crucial for low-resource languages.
Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.