QALB: Qatar Arabic language bank

Behrang Mohit

doi:10.5339/qfarf.2013.ICTP-032

oa QALB: Qatar Arabic language bank
المؤلفون: Behrang Mohit¹
عرض الانتماءات إخفاء الانتسابات المهنية

االانتسابات المهنيه ¹ Carnegie Mellon University in Qatar, Doha, QATAR
الناشر: Hamad bin Khalifa University Press (HBKU Press)
المصدر: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2013 Issue 1, نوفمبر ٢٠١٣, المجلد 2013, ICTP-032
DOI https://doi.org/10.5339/qfarf.2013.ICTP-032

ملخص

Automatic text correction has been attracting research attention for English and some other western languages. Applications for automatic text correction vary from improving language learning for humans and reducing noise in text input to natural language processing tools to correcting machine translation output for grammatical and lexical choice errors. Despite the recent focus on some Arabic language technologies, Arabic automatic correction is still a fairly understudied research problem. Modern Standard Arabic (MSA) is a morphologically and syntactically complex language, which poses multiple writing challenges not only to language learners, but also to Arabic speakers, whose dialects differ substantially from MSA. We are currently creating resources to address these challenges. Our project has two components: first is QALB (Qatar Arabic Language Bank), a large parallel corpus of Arabic sentences and their corrections, and second is ACLE (Automatic Correction of Language Errors), an Arabic text correction system trained and tested on the QALB corpus. The QALB corpus is unique in that: a) it will be the largest Arabic text correction corpus available, spanning two million words; b) it will cover errors produced by native-speakers, non-native speakers, and machine translation systems; and c) it will contain a trace of all the actions performed by the human annotators to achieve the final correction. This presentation describes the creation of two major components of the project: the web-based annotation interface and the annotation guidelines. QAWI (QALB Annotation Web Interface) is our web-based, language-independent annotation framework used for manual correction of the QALB corpus. Our framework provides intuitive interfaces for annotating text, managing a large number of human annotators and performing quality control. Our annotation interface, in particular, provides a novel token-based editing model for correcting Arabic text that allows us to reliably track all modifications. We demonstrate details of both the annotation and the administration interfaces as well as the back-end engine. Furthermore, we show how this framework is able to speed up the annotation process by employing automated annotators to correct basic Arabic spelling errors. We also discuss the evolution of our annotation guidelines from its early developments through its actual usage for group annotation. The guidelines cover a variety of linguistic phenomena, from spelling errors to dialectal variations and grammatical considerations. The guidelines also include a large number of examples to help annotators understand the general principles behind the correction rules and not simply memorize them. The guidelines were written in parallel to the development of our web-based annotation interface and involved several iterations and revisions. We periodically provided new training sessions to the annotators and measured their inter-annotator agreement. Furthermore, the guidelines were updated and extended using feedback from the annotators and the inter-annotator agreement evaluations. This project is supported by the National Priority Research Program (NPRP grant 4-1058-1-168) of the Qatar National Research Fund (a member of the Qatar Foundation). The statements made herein are solely the responsibility of the authors.

جارٍ تحميل قياسات المقالة...

/content/papers/10.5339/qfarf.2013.ICTP-032

٢٠١٣-١١-٢٠

٢٠٢٤-١١-٢٢

Full text loading...

/content/papers/10.5339/qfarf.2013.ICTP-032

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

- Barriers and facilitators influencing the physical activity of Arabic adults: A literature review
  
  المؤلفون: Kathleen Benjamin and Tam Truong Donnelly
- Effect of green marketing on consumer purchase behavior
  
  المؤلفون: Narges Delafrooz, Mohammad Taleghani and Bahareh Nouri
- Osteoporosis: An under-recognized public health problem
  
  المؤلفون: Rajasree Vijayakumar and Dietrich Büsselberg
- E-learning in Saudi Arabia: Past, present and future
  
  المؤلفون: Ali Mohammad Al-Asmari and M Shamsur Rabb Khan
- Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia
  
  المؤلفون: Lamia M. Al-Naama, Mea'ad K. Hassan and Jawad K. Mehdi
مزيد أقل

oa QALB: Qatar Arabic language bank

ملخص

Most Read This Month

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Effect of green marketing on consumer purchase behavior

Osteoporosis: An under-recognized public health problem

E-learning in Saudi Arabia: Past, present and future

Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia