Search papers, labs, and topics across Lattice.
The paper introduces Exa-PSD, a new Persian sentiment analysis dataset collected from Twitter, comprising 12,000 tweets annotated with positive, neutral, and negative sentiment labels by five native Persian speakers. This dataset addresses the scarcity of general-domain Persian sentiment analysis resources, as existing datasets are often topic-specific. Evaluation using ParsBERT, RoBERTa, and LLMs achieved a macro F-score of 79.87%, demonstrating the dataset's utility for training and evaluating sentiment analysis models.
A new 12,000-tweet Persian sentiment analysis dataset, Exa-PSD, fills a critical gap for NLP research on Persian social media.
Today, social networks such as Twitter are among the most widely used platforms for communication. Analyzing the data generated on these platforms provides valuable insights into people’s opinions being expressed in their tweets. Sentiment analysis, a key task in Natural Language Processing (NLP), aims to identify individuals’ sentiments regarding specific topics. Despite recent advances in powerful language models, natural language processing for the Persian language still faces many challenges. The datasets available in Persian are generally in special topics such as products, foods, hotels, etc. And totally there are few datasets in Persian for sentiment analysis. To overcome these challenges, there is a necessity for having a dataset in Persian for sentiment analysis on Twitter. In this paper, we introduce the Exa Persian Sentiment analysis Dataset, which is collected from Persian tweets. This dataset contains 12,000 tweets, annotated by 5 native Persian taggers. The aforementioned data is labeled in 3 classes: positive, neutral and negative. We present the characteristics and statistics of this dataset and use the pre-trained ParsBERT, RoBERTa and LLM as the base model to evaluate this dataset. Our evaluation reached a 79.87 Macro F-score, which shows the model and data can be adequately valuable for a sentiment analysis system.