Unsupervised Arabic word segmentation and statistical machine translation

Hanan Alshikhabobakr

doi:10.5339/qfarf.2013.ICTP-050

oa Unsupervised Arabic word segmentation and statistical machine translation
By Hanan Alshikhabobakr¹
View Affiliations Hide Affiliations

Affiliations: ¹ Carnegie Mellon University in Qatar, Doha, QATAR
Publisher: Hamad bin Khalifa University Press (HBKU Press)
Source: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2013 Issue 1, Nov 2013, Volume 2013, ICTP-050
DOI: https://doi.org/10.5339/qfarf.2013.ICTP-050

Abstract

Word segmentation is a necessary step for natural language processing applications, such as machine translation and parsing. In this research we focus on Arabic word segmentation to study its impact on Arabic to English translation. There are accurate word segmentation systems for Arabic, such as MADA (Habash, 2007). However, such systems usually need manually-built data and rules of the Arabic language. In this work, we look at unsupervised word segmentation systems to see how well they perform on Arabic, without relying on any linguistic information about the language. The methodology of this research can be applied to many other morphologically complex languages. We focus on three leading unsupervised word segmentation systems proposed in the literature: Morfessor (Creutz and Lagus, 2002), ParaMor (Monson, 2007), and Demberg's system (Demberg, 2007). We also use two different segmentation schemes of the state of the art MADA and compare their precision with the unsupervised systems. After training the three unsupervised segmentation systems, we apply their resulting models to segment the Arabic part of the parallel data for Arabic to English statistical machine translation (SMT) and measure its impact on translation quality. We also build segmentation models using the two schemes of MADA on SMT to compare against the baseline system. The 10-fold cross validation results indicate that unsupervised segmentation systems turn out to be usually inaccurate with a precision that is less than 40%, and hence do not help with improving SMT quality. We also observe both segmentation schemes of MADA have very high precision. We experimented with two MADA schemes. A scheme with a measured segmentation framework improved the translation accuracy. A second scheme which performs more aggressive segmentation, failed to improve SMT quality. We also provide some rule-based supervision to correct some of the errors in our best unsupervised models. While this framework performs better than the baseline unsupervised systems, it still does not outperform the baseline MT quality. We conclude that in our unsupervised framework, the noise by the unsupervised segmentation offsets the potential gains that segmentation could provide to MT. We conclude that a measured supervised word segmentation improves Arabic to English quality. In contrast aggressive and exhaustive segmentation introduces new noise to the MT framework and actually harms its quality. This publication was made possible by the generous support of the Qatar Foundation through Carnegie Mellon University's Seed Research program provided to Kemal Oflazer. The statements made herein are solely the responsibility of the authors.

Article metrics loading...

/content/papers/10.5339/qfarf.2013.ICTP-050

2013-11-20

2025-04-06

Full text loading...

/content/papers/10.5339/qfarf.2013.ICTP-050

Most Cited Most Cited RSS feed

- oa Barriers and facilitators influencing the physical activity of Arabic adults: A literature review
  
  Authors: Kathleen Benjamin and Tam Truong Donnelly
- oa Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria
  
  Authors: AA Ayandele, EK Oladipo, O Oyebisi and MO Kaka
- oa Effect of green marketing on consumer purchase behavior
  
  Authors: Narges Delafrooz, Mohammad Taleghani and Bahareh Nouri
- oa Osteoporosis: An under-recognized public health problem
  
  Authors: Rajasree Vijayakumar and Dietrich Büsselberg
- oa Evolution of emergency medical services in Saudi Arabia
  
  Authors: Talal AlShammari, Paul Jennings and Brett Williams
More Less

oa Unsupervised Arabic word segmentation and statistical machine translation

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

oa Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

oa Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria

oa Effect of green marketing on consumer purchase behavior

oa Osteoporosis: An under-recognized public health problem

oa Evolution of emergency medical services in Saudi Arabia