Department of Data Science InstitutInstitut Teknologi Sumatera LampungMay 6, 2026arXiv:2605.04885

A Comparative Study of PyCaret AutoML and CNN-BiLSTM for Binary Hate Speech Detection in Indonesian Twitter

Tanty Widiyastuti, Mayada, Adisty Syawalda Ariyanto, Luluk Muthoharoh, Ardika Satria, Martin Clinton Tosima Manullang

AI Summary

This paper benchmarks PyCaret AutoML against a CNN-BiLSTM architecture for binary hate speech detection in Indonesian Twitter, using a shared preprocessing pipeline to isolate modeling performance. The CNN-BiLSTM model, leveraging learned token embeddings and bidirectional context, outperformed the best PyCaret AutoML model (Random Forest) by 6.6% in accuracy and 4.2% in F1-score. Analysis reveals the dataset's challenges stem from short-text length, moderate class imbalance, and reliance on local lexical cues and short contextual composition.

Key Contribution

CNN-BiLSTM beats AutoML for Indonesian hate speech detection, but the gains are modest, suggesting the dataset's limitations are a bigger bottleneck than model architecture.

Abstract

This paper compares a PyCaret AutoML branch and a CNN-BiLSTM branch for binary hate speech detection on Indonesian Twitter using the HS label from the corpus of Ibrohim and Budi. Both branches share the same preprocessing pipeline so that the comparison reflects modelling differences rather than inconsistent data preparation. The conventional branch uses TF-IDF with a lexicon-based abusive-word count, whereas the neural branch learns dense token representations and captures both local phrase patterns and bidirectional context. The benchmark is built from the released 13,130-row annotation table, whose HS label yields a 58:42 class ratio. On the held-out split, CNN-BiLSTM achieves the best result with 83.8% accuracy, 79.8% precision, 82.7% recall, and 81.2% F1-score. Within the PyCaret branch, Random Forest is the strongest conventional model with 77.2% accuracy and 77.0% F1-score. The neural branch therefore improves accuracy by 6.6 points and F1-score by 4.2 points. Exploratory corpus analysis, learning curves, and confusion matrices show that the dataset is short-text, moderately imbalanced, and still difficult because many decisions depend on local lexical cues plus short contextual composition. The study concludes that PyCaret AutoML is an effective conventional benchmarking framework, whereas CNN-BiLSTM is the stronger end model for the reported benchmark setting.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Comparative Study of PyCaret AutoML and CNN-BiLSTM for Binary Hate Speech Detection in Indonesian Twitter

Related Papers