-
oa Toward a Cognitive Evaluation Approach for Machine Translation PostEditing
- Publisher: Hamad bin Khalifa University Press (HBKU Press)
- Source: Qatar Foundation Annual Research Conference Proceedings, Qatar Foundation Annual Research Conference Proceedings Volume 2018 Issue 3, Mar 2018, Volume 2018, ICTPD885
Abstract
Machine Translation (MT) today is used more and more by professional translators, including freelancers, companies, and official organisations, such as, for example, the European Parliament. MT output, especially of publicly available MT engines, such as Google Translate, is, however, well known to contain errors and lack fluency from human expectations» point of view. For this reason, the MT translated texts often need manual (or automatic) corrections, known as `Post-Editing» (PE).Although there are fast and simple measures of post-editing cost, such as time to post-edit, or edit-distance, these measures do not reflect the cognitive difficulty involved in correcting the specific errors in the MT output text. As the MT output texts can be of different quality and thus contain errors of different difficulty to be corrected, fair compensation of post-editing should take into account the difficulty of the task, which should thus be measured in the most reliable way. The best solution for this would be to build an automatic classifier which (a) assigns each MT error into a specific correction class, (b) assigns an effort value which reflects the cognitive effort a post-editor needs to make in order to make such a correction, and (c) gives a post-editing effort score to a text. On our way of building such a classifier, we investigate whether an existing cognitive effort model could provide a fairer compensation for the post-editor, by testing it on a new language which strongly differs from the previous languages on which this methodology was tested.The model made use of the Statistical Machine Translation (SMT) error classification schema from which the error classes were subsequently re-grouped and ranked in an increasing order, so as to reflect the cognitive load post-editors experience while correcting the MT output. Error re-grouping and ranking was done on the basis of relevant psycholinguistic error correction literature. The aim of proposing such an approach was to create a better metric for the effort a post-editor faces while correcting MT texts, instead of relying on a non-transparent MT evaluation score such as BLEU.The approach does not rely on using specific software, in contrast to PE cognitive evaluation approaches which are based on keystroke logging or eye-tracking. Furthermore, the approach is more objective than the approaches which rely on human scores for perceived post-editing effort. In its essence, it is similar to other error classification approaches. It is enriched, however by error ranking, based on information specifying which errors require more cognitive effort to be corrected, and which less. In this way, the approach only requires counting the number of errors of each type in the MT output. And thus it allows the comparison of the post-editing cost of different output texts of the same MT engine, the same text as an output of different MT engines, or for different language pairs.Temnikova et al. (2010) tested her approach on two emergency instructions texts, one original (called `Complex») and one manually simplified (called ``Simplified»»), according to Controlled Language (CL) text simplification rules. Both texts were translated using the web version of Google Translate into three languages: Russian, Spanish, and Bulgarian. The MT output was manually post-edited by 3-5 human translators per language and then the number of errors per category was manually counted by one annotator per language.Several researchers based their work on Temnikova»s cognitive evaluation approach. Among them, Koponen et al. (2012) have modified the error classification by adding one additional class.Lacruz and Munoz et al. (2014) enriched our original error ranking/classification with numerical weights from 1 to 9, which showed a good correlation with another metric they used (Pause to Word Ratio), but did not normalize the scores per text length. The weights were added to form a unique score for each text called Mental Load (ML).The current work presented in this abstract makes the following contributions, compared to our previous work: (1) We separate the Controlled Language (CL) evaluation as it was in Temnikova work from the MT evaluation and applies it only as MT evaluation. (2) We test the error classification and ranking method on a new (Non-Indo-European) language (Modern Standard Arabic, MSA). (3) We increase the number of annotators and textual data. (4) We test the approach on new text genres (news articles). On our way of building a classifier which would assign post-editing effort scores to new texts, we have conducted a new experiment, aiming to test whether a previously introduced approach applies also to Arabic, a language different from those for which the cognitive evaluation model was initially developed.The results of the experiment confirmed once again that Machine Translation (MT) texts of different translation quality exhibit different distributions of error categories, with the texts with lower MT quality containing more errors, and error categories which are more difficult to correct (e.g. word order errors). The results also showed some variation in the presence of certain categories of errors, which we deem being typical for Arabic. The comparison of texts of better MT quality showed similar results across all four languages (Modern Standard Arabic, Russian, Spanish, and Bulgarian), which shows that the approach can be applied without modification also to non-Indo-European languages in order to distinguish the texts of better MT quality from those of worse MT quality.In future work, we plan to adapt the error categories to Arabic (e.g., add the category “merge tokens»»), in order to test if such language-specific adaptation would lead to better results for Arabic. We plan to use a much bigger dataset and extract most of the categories automatically.We also plan to assign weights and develop a unique post-editing cognitive difficulty score for MT output texts. We are confident that this will provide a fair estimation of the cognitive effort required for post-editors to edit such texts, and will help translators to receive a fair compensation for their work.