Credibility Models For Arabic Content On Twitter

Reem El-ballouli; Wassim El Hajj; Shady Elbassuoni; Hazem Hajj; Nizar Habash; Khaled Bashir Shaban

doi:10.5339/qfarc.2014.ITSP0773

oa Credibility Models For Arabic Content On Twitter
المؤلفون: Reem El-ballouli¹, Wassim El Hajj¹, Shady Elbassuoni¹, Hazem Hajj¹, Nizar Habash¹ and Khaled Bashir Shaban¹
عرض الانتماءات إخفاء الانتسابات المهنية

االانتسابات المهنيه ¹ American University Of Beirut, Beiurt, Lebanon
الناشر: Hamad bin Khalifa University Press (HBKU Press)
المصدر: Qatar Foundation Annual Research Conference Proceedings, Qatar Foundation Annual Research Conference Proceedings Volume 2014 Issue 1, نوفمبر ٢٠١٤, المجلد 2014, ITSP0773
DOI https://doi.org/10.5339/qfarc.2014.ITSP0773

ملخص

Microblogging websites such as Twitter have gained popularity as an effective and quick means of expressing opinions, sharing news and promoting information and updates. As a result, data generated on Twitter has become a vital and rich source for tasks such as sentiment mining or newsgathering. However, a significant portion of such data is either biased, untruthful, spam or non-credible in general. Consequently, filtering out non-credible tweets when performing data analyses tasks on Twitter becomes a crucial task. In this work, we present a credibility model for content on Twitter. Unlike previous work that focused on English content or factual tweets, our work analyses the credibility of any tweet type and targets Arabic tweets, a challenging language for NLP in general. We focus on Arabic tweets due to the recent popularity of Twitter in the Arab world and due to the presence of a large portion of non-credible tweets in Arabic. We build a binary credibility classifier that classifies a tweet that belongs to a given topic as either credible or non-credible. Our classifier relies on an exhaustive set of features extracted from both the author of the tweet (user-based) and the tweet itself (content-based). To achieve our objective, we collected 36,155,670 tweets through Twitter streaming API over a period of two weeks and created an index to search our tweet collection. Three topics about the Syrian revolution were retrieved from the collection and given to annotators to annotate. Unlike previous work, we provided annotators with a unique interface that provided a real context for each tweet, such as the author profile and a Web search about the content of the tweet, which we deemed useful for annotators to judge the credibility of a tweet. Overall, 3,393 tweets were annotated for credibility using this interface. Next, we extracted 22 user-based features; such as the expertise of the author on the topic of the tweet, time spacing between her previous tweets, count of followers, etc. In addition, 22 content-based features were extracted including sentiment, count of retweets, count of URLs, etc. Finally, we trained a set of classifiers based on all the features we extracted and using our annotated corpus of tweets as training data. We evaluated our credibility classifiers using a series of carefully designed experiments. Using cross validation on our three different topics and on a combined dataset that contains all the tweets from all the topics, our classifiers surpassed the accuracy of a number of baseline approaches by significant margins. We then applied feature reduction and normalization, which resulted in an additional marginal improvement in accuracy. Finally, to test the robustness of our chosen set of features, we evaluated our model using different training and testing sets. Our classifiers continued to consistently surpass the accuracy of the baseline approaches. Furthermore, we analyzed our feature set by comparing the accuracy of the classifier when trained on user-based features only versus content-based features only. Overall, content-based features only generated better accuracies than user-based features only when tested on multiple topics.

جارٍ تحميل قياسات المقالة...

/content/papers/10.5339/qfarc.2014.ITSP0773

٢٠١٤-١١-١٨

٢٠٢٥-٠٤-١٦

Full text loading...

/content/papers/10.5339/qfarc.2014.ITSP0773

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

- oa Barriers and facilitators influencing the physical activity of Arabic adults: A literature review
  
  المؤلفون: Kathleen Benjamin and Tam Truong Donnelly
- oa Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria
  
  المؤلفون: AA Ayandele, EK Oladipo, O Oyebisi and MO Kaka
- oa Effect of green marketing on consumer purchase behavior
  
  المؤلفون: Narges Delafrooz, Mohammad Taleghani and Bahareh Nouri
- oa Osteoporosis: An under-recognized public health problem
  
  المؤلفون: Rajasree Vijayakumar and Dietrich Büsselberg
- oa Evolution of emergency medical services in Saudi Arabia
  
  المؤلفون: Talal AlShammari, Paul Jennings and Brett Williams
مزيد أقل

oa Credibility Models For Arabic Content On Twitter

ملخص

Most Read This Month

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

oa Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

oa Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria

oa Effect of green marketing on consumer purchase behavior

oa Osteoporosis: An under-recognized public health problem

oa Evolution of emergency medical services in Saudi Arabia