QUTFeb 19, 2026arXiv:2602.17051

Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data

Deepak Uniyal, Md Abul Bashar, Richi Nayak

AI Summary

This paper investigates four cross-lingual text classification approaches for filtering relevant content from a multilingual social media dataset of 9 million tweets related to hydrogen energy in English, Japanese, Hindi, and Korean. The approaches include translating annotated English data, translating unlabelled data into English, applying English fine-tuned multilingual transformers, and a hybrid strategy. The study evaluates each approach's ability to filter hydrogen-related tweets and performs topic modeling to extract dominant themes, revealing trade-offs between translation and multilingual methods for cross-lingual social media analysis.

Key Contribution

Multilingual transformers fine-tuned on English data can effectively filter relevant content from noisy, keyword-based social media datasets in other languages, rivaling methods that rely on costly translation.

Abstract

Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.

Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data

Related Papers