1887

Abstract

Annotation Guidelines for Text Analytics in Social Media A person's language use reveals much about their profile, however, research in author profiling has always been constrained by the limited availability of training data, since collecting textual data with the appropriate meta-data requires a large collection and annotation effort (Maamouri et al. 2010; Diab et al. 2008; Hawwari et al. 2013).For every text, the characteristics of the author have to be known in order to successfully profile the author. Moreover, when the text is written in a dialectal variety such as the Arabic text found online in social media a representative dataset need to be available for each dialectal variety (Zaghouani et al. 2012; Zaghouani et al. 2016).The existing Arabic dialects are historically related to the classical Arabic and they co-exist with the Modern Standard Arabic in a diglossic relation. While the standard Arabic, has a clearly defined set of orthographic standards, the various Arabic dialects have no official orthographies and a given word could be written in multiple ways in different Arabic dialects (Maamouri et al. 2012; Jeblee et al. 2014).This abstract presents the guidelines and annotation work carried out within the framework of the Arabic Author profiling project (ARAP), a project that aims at developing author profiling resources and tools for a set of 12 regional Arabic dialects. We harvested our data from social media which reflect a natural and spontaneous writing style in dialectal Arabic from users in different regions of the Arabworld.For the Arabic language and its dialectal varieties as foundin social media, to the best of our knowledge, there is nocorpus available for the detection of age, gender, nativelanguage and dialectal variety. Most of the existingresources are available for English or other Europeanlanguages. Having a large amount of annotated data remains the key to reliable results in the taskof author profiling. In order to start the annotation process, we createdguidelines for the annotation of the Tweets according totheir dialectal variety, their native language, the gender of the user and the age. Before starting theannotation process, we hired and trained a group of annotators and we implemented a smooth annotation pipeline to optimize the annotation task. Finally, we followed a consistent annotation evaluation protocol to ensure a high inter-annotator agreement.The Annotations were done by carefully analyzing each ofthe user's profiles, their tweets, and when possible, weinstructed the annotators to use external resources such asLinkedIn or Facebook. We created a general profilesvalidation guidelines and task-specific guidelines toannotate the users according to their gender, age, dialectand their native language. For some accounts, the annotators were not able to identifythe gender as this was based in most of the cases on thename of the person or his profile photo and in some casesby their biography or profile description. In case thisinformation is not available, we instructed the annotators toread the user posts and find linguistic indicators of thegender of the user.Like many other languages, Arabic conjugates verbsthrough numerous prefixes and suffixes and the gender issometimes clearly marked such as in the case of the verbsending in taa marbuTa which is usually of femininegender.In order to annotate the users for their age, we used threecategories: under 20 years, between 20 years and 40 years,and 40 years and up.In our guidelines, we asked our annotators to try their bestto annotate the exact age, for example, they can check theeducation history of the users in LinkedIn and Facebookprofile and find when the graduated from high school forexample in order to guess the age of the users. As the dialect and the regions are known in advance to theannotators, we instructed them to double check and markthe cases when the profile appears to be from a differentdialect group. This is possible despite our initial filteringbased on distinctive regional keywords. We noticed that inmore than 90% the profiles selected belong to the specifieddialect group. Moreover, we asked the annotators to mark and identifyTwitter profiles with a native language other than Arabic,so they are considered as Arabic L2 speakers. In order tohelp the annotators identify those, we instructed them tolook for various cues such as the writing style, the sentence structure, the word order and the spelling errors.AcknowledgementsThis publication was made possible by NPRP grant #9-175-1-033 from the Qatar National Research Fund (a member ofQatar Foundation). The statements made herein are solelythe responsibility of the authors. ReferencesDiab Mona, Aous Mansouri, Martha Palmer, Olga Babko-Malaya, Wajdi Zaghouani, Ann Bies, Mohammed Maamouri. A Pilot Arabic Propbank; LREC 2008, Marrakech, Morocco, May 28-30, 2008.Hawwari, A.; Zaghouani, W.; O»Gorman, T.; Badran, A.; Diab, M., «Building a Lexical Semantic Resource for Arabic Morphological Patterns,» Communications, Signal Processing, and their Applications (ICCSPA), 2013, vol., no., pp.1,6, 12-14 Feb. 2013. Jeblee Serena; Houda Bouamor; Wajdi Zaghouani; Kemal Oflazer. CMUQ@QALB-2014: An SMT-based System for Automatic Arabic Error Correction. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar, October 2014.Maamouri Mohamed, Ann Bies, Seth Kulick, Wajdi Zaghouani, Dave Graff and Mike Ciul. 2010. From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News. In Proceedings of LREC 2010, Valetta, Malta, May 17-23, 2010.Maamouri Mohammed, Wajdi Zaghouani, Violetta Cavalli-Sforza, Dave Graff and Mike Ciul. 2012. Developing ARET: An NLP-based Educational Tool Set for Arabic Reading Enhancement. In Proceedings of The 7th Workshop on Innovative Use of NLP for Building Educational Applications, NAACL-HLT 2012, Montreal, Canada.Obeid Ossama, Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Kemal Oflazer and Nadi Tomeh. A Web-based Annotation Framework For Large- Scale Text Correction. In Proceedings of IJCNLP'2013, Nagoya, Japan.Zaghouani Wajdi, Nizar Habash, Ossama Obeid, Behrang Mohit, Houda Bouamor, Kemal Oflazer. 2016. Building an arabic machine translation post-edited corpus: Guidelines and annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC»2016).Zaghouani Wajdi, Abdelati Hawwari and Mona Diab. 2012. A Pilot PropBank Annotation for Quranic Arabic. In Proceedings of the first workshop on Computational Linguistics for Literature, NAACL-HLT 2012, Montreal, Canada.

Loading

Article metrics loading...

/content/papers/10.5339/qfarc.2018.ICTPD879
2018-03-15
2024-11-19
Loading full text...

Full text loading...

/content/papers/10.5339/qfarc.2018.ICTPD879
Loading
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error