Automatic long audio alignment for conversational Arabic speech

Mohamed Elmahdy

doi:10.5339/qfarf.2013.ICTP-03

oa Automatic long audio alignment for conversational Arabic speech
المؤلفون: Mohamed Elmahdy¹
عرض الانتماءات إخفاء الانتسابات المهنية

¹ Qatar University, Doha, QATAR
الناشر: Hamad bin Khalifa University Press (HBKU Press)
المصدر: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2013 Issue 1, نوفمبر ٢٠١٣, المجلد 2013, ICTP-03
DOI https://doi.org/10.5339/qfarf.2013.ICTP-03

Long Audio Alignment is a known problem in speech processing in which the goal is to align a long audio input with the corresponding text. Accurate alignments help in many speech processing tasks such as audio indexing, speech recognizer's acoustic model training, audio summarizing and retrieving, etc. In this work, we have collected more than 1400 hours of conversational Arabic speech extracted from Al-Jazeerah podcasts besides the corresponding non-aligned text transcriptions. Podcast's length varies from 20-50 minutes each. Five episodes have been manually aligned that meant to be used in evaluating alignment accuracy. For each episode, a split and merge segmentation approach is applied to segment audio file into small segments of average length of 5 sec. having filled pauses on the boundary of each segment. A pre-processing stage in applied on the corresponding raw transcriptions to remove titles, headings, images, speaker's names, etc. A biased language model (LM) is trained on the fly using the processed text. Conversational Arabic speech is mostly spontaneous and influenced by dialectal Arabic. Since phonemic pronunciation modeling is not always possible for non-standard Arabic words, a graphemic pronunciation model (PM) is utilized to generate one pronunciation variant for each word. Unsupervised acoustic model adaptation in applied on a pre-trained Arabic acoustic model using the current podcast audio. The adapted AM along with the biased LM and the graphemic PM are used in a fast speech recognition pass applied on the current podcast's segments. Recognizer's output is aligned with the processed transcriptions using Levenshtein distance algorithm. This way we can ensure error recovery where miss-alignment of a certain segment does not affect alignment of later segments. The proposed approach resulted in an alignment accuracy of 97% on the evaluation set. Most of miss-alignment errors were found to be with segments having significant background noise (music, channel noise, cross-talk, etc.) or significant speech disfluencies (truncated words, repeated words, hesitations, etc.). For some speech processing tasks like acoustic model training, it is required to eliminate miss-aligned segments from the training data. That is why a confidence scoring metric is proposed to accept/reject aligner output. The score is provided for each segment and it is basically the Min-Edit distance between recognizer's output and the aligned text. By using confidence scores, it was possible to reject the majority of miss-aligned segments resulting in 99% alignment accuracy. This work was funded by a grant from the Qatar National Research Fund under its National Priorities Research Program (NPRP) award number NPRP 09-410-1-069. Reported experimental work was performed at Qatar University in collaboration with University of Illinois.

جارٍ تحميل قياسات المقالة...

/content/papers/10.5339/qfarf.2013.ICTP-03

٢٠١٣-١١-٢٠

٢٠٢٥-١٢-١٣

القياسات

Full text loading...

/content/papers/10.5339/qfarf.2013.ICTP-03

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

- Barriers and facilitators influencing the physical activity of Arabic adults: A literature review
  
  المؤلفون: Kathleen Benjamin and Tam Truong Donnelly
- Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum
  
  المؤلفون: Mohammad Asim, Farhana Amin and Ayman El-Menyar
- Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria
  
  المؤلفون: AA Ayandele, EK Oladipo, O Oyebisi and MO Kaka
- Effect of green marketing on consumer purchase behavior
  
  المؤلفون: Narges Delafrooz, Mohammad Taleghani and Bahareh Nouri
- Evolution of emergency medical services in Saudi Arabia
  
  المؤلفون: Talal AlShammari, Paul Jennings and Brett Williams
مزيد أقل

oa Automatic long audio alignment for conversational Arabic speech

القياسات

Most Read This Month

الأكثر اقتباسًا لهذا الشهر Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Multiple organ dysfunction syndrome: Contemporary insights on the clinicopathological spectrum

Prevalence of Multi-Antibiotic Resistant Escherichia coli and Klebsiella species obtained from a Tertiary Medical Institution in Oyo State, Nigeria

Effect of green marketing on consumer purchase behavior

Evolution of emergency medical services in Saudi Arabia