Mar 5, 2026arXiv:2603.05057

MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

Inayat Arshad, Fajar Saleem, Ijaz Hussain

AI Summary

The paper introduces MUTEX, a multilingual transformer-CRF framework for Urdu toxic span detection, addressing the limitations of sentence-level classification in identifying specific toxic spans. MUTEX leverages XLM-RoBERTa with a CRF layer for sequence labeling, trained on a newly created, manually annotated token-level dataset. The framework achieves a 60% token-level F1 score, establishing the first supervised baseline for this task and demonstrating the effectiveness of transformers in capturing contextual toxicity in Urdu.

Key Contribution

A new model, MUTEX, achieves 60% token-level F1 score on Urdu toxic span detection, providing the first supervised baseline for a challenging low-resource language.

Abstract

Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References78

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

Related Papers