Building a Rich Lexical Resource for Standard Arabic

Wajdi Zaghouani; Sawsan Sawsan Alqahtani; Mona Diab

doi:10.5339/qfarc.2018.SSAHPD880

Abstract

Language ambiguity is an inherent characteristic of natural languages. It refers to the phenomenon where an instance can be interpreted in multiple ways. Ambiguity is at the core of the problems faced by natural language processing applications (Obeid et al. 2013). Although humans have the ability to resolve such ambiguity based on their prior knowledge and context, there are instances (sentences, words,... etc) that require multiple readings to resolve it within a context (Hawwari et al. 2013; Diab et al. 2008). The problem of natural language ambiguity is further exacerbated by conventional orthographic decisions where not all phonemes are explicitly represented (Maamouri et al. 2010; Maamouri et al. 2012). Arabic standard orthography is one of these languages that is underspecified for some of the characters such as short vowels, gemination, glottal stops, etc which are collectively represented as diacritics (Zaghouani et al. 2012; Zaghouani et al. 2016). Most typical text in Arabic is rendered undiacritized, i.e. missing explicit short vowels and other diacritics, thereby compounding the natural linguistic ambiguity of the language. Fully orthographically specified Modern standard Arabic (MSA) consists of letters (consonants and long vowels) and diacritic marks. Diacritic marks involve short vowels (u i a), gemination marks, nunation, and the absence of vowels. These diacritics are helpful in denoting the pronunciations and meanings of such underspecified words (Jeblee et al. 2014). A resource that lists words in their typical underspecified form and their corresponding possible meanings would be quite useful for multiple purposes such as building NLP tools, psycho-linguistic and socio-linguistic studies, as well as pedagogical applications. In this abstract, we present a monolingual lexical resource for MSA, which provides for each undiactrized word: various possible diacritic alternatives, part of speech information (POS), and frequency of usage information, in addition to usage examples. It is a large-scale automatically acquired inventory of words from multiple genres. The main objective of this inventory is to explicitly mark undiacritized forms of Arabic words when they are ambiguous. We use the morphological analyzer and disambiguator, MADAMIRA, to generate the desired features: POS, diacritic alternatives, and lemmas. Our lexical resource represents different aspects of ambiguity at the word level: POS (syntactic level) and diacritic alternatives (lexical level). At the syntactic level, ambiguity indicates that the undiacritized word can be given multiple possible POS tags. If there is only one possible POS tag for the undiacritized word, then the word is syntactically unambiguous. For lexical ambiguity, an undiacritized word may have multiple readings due to multiple possible diacritizations or the same diacritized form would have multiple meanings (similar to the bank «financial institution» /bank «river bank», in English). We account for all three ambiguity cases in our presented resource. The absence of diacritics adds an additional layer of ambiguity in MSA. Diacritics help specify the exact meanings or even reduce the number of possible senses for a given undiacritized word. Although this sounds appealing and has proven beneficial in some tasks, full diacritization might also have performance degradation in some natural language processing applications and human reading speed. We observed three types of ambiguity caused by diacritics: ambiguity within POS tags, ambiguity for the same grapheme without considering POS tags, and ambiguity that is related to case and mood information. The former type concerns structural and grammatical level of ambiguity whereas the first two types are lexical which is our focus in this paper. It has been claimed that frequency may play a significant role in disambiguation where words that frequently occur tend to be less ambiguous and that such frequency varies depending on the genre. The presented lexical resource provides three types of frequencies: undiacritized words, diacritized words, diacritized words within a particular POS in addition to fine-grained frequencies for each genre so that researchers would be able to pick certain genres suitable for their studies. This lexical resource shows gaps in the frequency distributions among the alternative choices for each undiacritized word which may lead to having multiple choices for the same undiacritized word that have equal or close frequency approximation.The main objective of this lexical resource is to help lexical-decision making based on explicitly marking within-POS ambiguity which means having multiple diacritic alternatives for the same undiacritized words within a particular POS. It also provides lexical information that is automatically generated including diacritic alternatives, POS, word length, frequencies (within and across varying corpora of different domains and genres) in addition to explicitly marking undiacritized words that have multiple possible POS, as well as it provides usage examples. This resource will be used for readability experiments where we evaluate the impact of ambiguity and level of diacritization in human readings. ReferencesDiab Mona, Aous Mansouri, Martha Palmer, Olga Babko-Malaya,Wajdi Zaghouani, Ann Bies, Mohammed Maamouri. A Pilot Arabic Propbank; LREC 2008, Marrakech, Morocco, May 28-30, 2008.Hawwari, A.; Zaghouani, W.; O»Gorman, T.; Badran, A.; Diab, M., «Building a Lexical Semantic Resource for Arabic Morphological Patterns,» Communications, Signal Processing, and their Applications (ICCSPA), 2013, vol., no., pp.1,6, 12-14 Feb. 2013. Jeblee Serena; Houda Bouamor; Wajdi Zaghouani; Kemal Oflazer. CMUQ@QALB-2014: An SMT-based System for Automatic Arabic Error Correction. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar, October 2014.Maamouri Mohamed, Ann Bies, Seth Kulick, Wajdi Zaghouani, Dave Graff and Mike Ciul. 2010. From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News. In Proceedings of LREC 2010, Valetta, Malta, May 17-23, 2010.Maamouri Mohammed, Wajdi Zaghouani, Violetta Cavalli-Sforza, Dave Graff and Mike Ciul. 2012. Developing ARET: An NLP-based Educational Tool Set for Arabic Reading Enhancement. In Proceedings of The 7th Workshop on Innovative Use of NLP for Building Educational Applications, NAACL-HLT 2012, Montreal, Canada.Obeid Ossama, Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Kemal Oflazer and Nadi Tomeh. A Web-based Annotation Framework For Large- Scale Text Correction. In Proceedings of IJCNLP’2013, Nagoya, Japan.Zaghouani Wajdi, Nizar Habash, Ossama Obeid, Behrang Mohit, Houda Bouamor, Kemal Oflazer. 2016. Building an Arabic machine translation post-edited corpus: Guidelines and annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC»2016).Zaghouani Wajdi, Abdelati Hawwari and Mona Diab. 2012. A Pilot PropBank Annotation for Quranic Arabic. In Proceedings of the first workshop on Computational Linguistics for Literature, NAACL-HLT 2012, Montreal, Canada.

oa Building a Rich Lexical Resource for Standard Arabic

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Osteoporosis: An under-recognized public health problem

Effect of green marketing on consumer purchase behavior

E-learning in Saudi Arabia: Past, present and future

Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia