Translate or Transliterate? Modeling the Decision For English to Arabic Machine Translation

Mahmoud Azab

doi:10.5339/qfarf.2013.ICTP-066

oa Translate or Transliterate? Modeling the Decision For English to Arabic Machine Translation
المؤلفون: Mahmoud Azab¹
عرض الانتماءات إخفاء الانتسابات المهنية

¹ Carnegie Mellon University, Doha, Qatar
الناشر: Hamad bin Khalifa University Press (HBKU Press)
المصدر: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2013 Issue 1, نوفمبر ٢٠١٣, المجلد 2013, ICTP-066
DOI https://doi.org/10.5339/qfarf.2013.ICTP-066

Translation of named entities (NEs) is important for NLP applications such as Machine Translation (MT) and Cross-lingual Information Retrieval. For MT, named entities are major subset of the out-of-vocabulary terms. Due to their diversity, they cannot always be found in parallel corpora, dictionaries or gazetteers. Thus, state-of-the-art MT systems need to handle NEs in speciï¬c ways: (i) direct translation which results in missing many out of vocabulary terms and (ii) blind transliteration of out of vocabulary terms which does not necessarily contribute to translation adequacy and may actually create noisy contexts for the language model and the decoder. For example, in the sentence "Dudley North visits North London", the MT system is expected to transliterate "North" in the former case, and translate "North" in the latter. In this work, we present a classification-based framework, that enables MT system to automate the decision of translation vs. transliteration for different categories of NEs. We model the decision as a binary classification at the token level: each token within a named-entity gets a decision label to be translated or transliterated. Training the classifier requires a set of NEs with token-level decision labels. For this purpose, we automatically construct a set of bilingual lexicon of NEs paired with the translation/transliteration decisions from two different domains: We heuristically extract and label parallel NEs from a large word aligned news parallel corpus and we use a lexicon of bilingual NEs collected from Arabic and Wikipedia titles. Then, we designed a procedure to clean up the noisy Arabic NE spans by part-of-speech verification, and heuristically ï¬ltering impossible items (e.g. verbs). For training, the data is automatically annotated using a variant of edit distance measuring the similarity between an English word and its Arabic transliteration. For test set, we manually reviewed the labels and fixed the incorrect ones. As part of our project, this bilingual corpus of named entities has been released to the research community. Using Support Vector Machines, we trained the classifier using a set of token-based, contextual and semantic features of the NEs. We evaluated our classiï¬er both in the limited news and diverse Wikipedia domains, and achieved promising accuracy of 89.1%. To study the utility of using our classifier on an English to Arabic statistical MT system, we deployed it as a pre-translation component to the MT system. We automatically located the NEs in the source language sentences and used the classiï¬er to ï¬nd those which should be transliterated. For such terms, we offer the transliterated form as an option to the decoder. The impact of adding the classifier to the SMT pipeline resulted in a major reduction of out of vocabulary terms and a modest improvement of the BLEU score. This research is supported by the Qatar National Research Fund (a member of the Qatar Foundation) through grants NPRP-09-1140-1-177 and YSREP-1-018-1-004. The statements made herein are solely the responsibility of the authors.

جارٍ تحميل قياسات المقالة...

/content/papers/10.5339/qfarf.2013.ICTP-066

٢٠١٣-١١-٢٠

٢٠٢٥-١٢-١٣

القياسات

Full text loading...

/content/papers/10.5339/qfarf.2013.ICTP-066

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

- Barriers and facilitators influencing the physical activity of Arabic adults: A literature review
  
  المؤلفون: Kathleen Benjamin and Tam Truong Donnelly
- Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum
  
  المؤلفون: Mohammad Asim, Farhana Amin and Ayman El-Menyar
- Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria
  
  المؤلفون: AA Ayandele, EK Oladipo, O Oyebisi and MO Kaka
- Effect of green marketing on consumer purchase behavior
  
  المؤلفون: Narges Delafrooz, Mohammad Taleghani and Bahareh Nouri
- Evolution of emergency medical services in Saudi Arabia
  
  المؤلفون: Talal AlShammari, Paul Jennings and Brett Williams
مزيد أقل

oa Translate or Transliterate? Modeling the Decision For English to Arabic Machine Translation

القياسات

Most Read This Month

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum

Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria

Effect of green marketing on consumer purchase behavior

Evolution of emergency medical services in Saudi Arabia