1887

Abstract

Automatic text correction has been attracting research attention for English and some other western languages. Applications for automatic text correction vary from improving language learning for humans and reducing noise in text input to natural language processing tools to correcting machine translation output for grammatical and lexical choice errors. Despite the recent focus on some Arabic language technologies, Arabic automatic correction is still a fairly understudied research problem. Modern Standard Arabic (MSA) is a morphologically and syntactically complex language, which poses multiple writing challenges not only to language learners, but also to Arabic speakers, whose dialects differ substantially from MSA. We are currently creating resources to address these challenges. Our project has two components: first is QALB (Qatar Arabic Language Bank), a large parallel corpus of Arabic sentences and their corrections, and second is ACLE (Automatic Correction of Language Errors), an Arabic text correction system trained and tested on the QALB corpus. The QALB corpus is unique in that: a) it will be the largest Arabic text correction corpus available, spanning two million words; b) it will cover errors produced by native-speakers, non-native speakers, and machine translation systems; and c) it will contain a trace of all the actions performed by the human annotators to achieve the final correction. This presentation describes the creation of two major components of the project: the web-based annotation interface and the annotation guidelines. QAWI (QALB Annotation Web Interface) is our web-based, language-independent annotation framework used for manual correction of the QALB corpus. Our framework provides intuitive interfaces for annotating text, managing a large number of human annotators and performing quality control. Our annotation interface, in particular, provides a novel token-based editing model for correcting Arabic text that allows us to reliably track all modifications. We demonstrate details of both the annotation and the administration interfaces as well as the back-end engine. Furthermore, we show how this framework is able to speed up the annotation process by employing automated annotators to correct basic Arabic spelling errors. We also discuss the evolution of our annotation guidelines from its early developments through its actual usage for group annotation. The guidelines cover a variety of linguistic phenomena, from spelling errors to dialectal variations and grammatical considerations. The guidelines also include a large number of examples to help annotators understand the general principles behind the correction rules and not simply memorize them. The guidelines were written in parallel to the development of our web-based annotation interface and involved several iterations and revisions. We periodically provided new training sessions to the annotators and measured their inter-annotator agreement. Furthermore, the guidelines were updated and extended using feedback from the annotators and the inter-annotator agreement evaluations. This project is supported by the National Priority Research Program (NPRP grant 4-1058-1-168) of the Qatar National Research Fund (a member of the Qatar Foundation). The statements made herein are solely the responsibility of the authors.

Loading

Article metrics loading...

/content/papers/10.5339/qfarf.2013.ICTP-032
2013-11-20
2024-12-24
Loading full text...

Full text loading...

/content/papers/10.5339/qfarf.2013.ICTP-032
Loading
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error