1887

Abstract

This work focuses on developing a contextual spellchecker to improve the correctness of input queries to a multi-lingual, cross-cultural robot receptionist system. Queries that have fewer misspellings will improve the robot's ability to answer them and in turn improve the effectiveness of the human-robot interaction. We focus on developing an n-gram model based contextual spell-checker to correct misspellings and increase the query-hit rate of the robot. Our test bed is a bi-lingual, cross-cultural robot receptionist, Hala, deployed at the Carnegie Mellon University in Qatar's reception area. Hala can accept typed input queries in Arabic and English and speak responses in both languages as she interacts with users. All input queries to Hala are logged. These logs allow the study of multi-lingual aspects, the influence of socio-cultural norms and the nature of human-robot interaction within a multicultural, yet primarily ethnic Arab, setting. A recent statistical analysis has shown that 26.3% of Hala's queries are missed. The missed queries are due to either Hala not having the required answer in the knowledge base or due to misspellings. We have measured that 50% are due to misspellings. We designed, developed and assessed a custom spellchecker based on an n-gram model. We focused our efforts on a spellchecker for the English mode of Hala. We trained our system on our existing language corpus consisting of valid input queries making the spellchecker more specific to Hala. Finally, we adjusted the n in the n-gram model and evaluated the correctness of the spellchecker in the context of Hala. Our system makes use of the Hunspell, which is an engine that uses algorithm based on n-gram similarity, rule and dictionary based pronunciation data and morphological analysis. Misspelled words are passed through the Hunspell spellchecker and the output is a list of possible words. Utilizing the list of words, we apply our n-gram model algorithm to find which word is best suited in a particular context. The model calculates the conditional probability P(w|s) of a word w given the previous sequence of words s, that is, predicting the next word based on the preceding n-1 words. To assess the effectiveness of our system, we evaluate it using 5 different cases of misspelled word location. The table below lists our results, correct indicates when the sentence is correctly spellchecked and incorrect when the sentences did not change after passing through the spellchecker, or the sentences included transliterated Arabic, or were incorrectly spellchecked which resulted in loss of semantics. Refer to table. We observed that context makes the spellchecking of a sentence more sensible resulting in a higher hit rate in Hala's knowledge base. For case 5, despite having more context than the previous cases, the hit rate is lower. This is because other sources of errors were introduced such as user making use of SMS languages or mixture of English and Arabic. In our future work we would like to tackle the above-mentioned problems and also work on a Part-of-Speech tagging system that would help in correcting real-word mistakes.

Loading

Article metrics loading...

/content/papers/10.5339/qfarf.2013.ICTSP-06
2013-11-20
2024-12-23
Loading full text...

Full text loading...

/content/papers/10.5339/qfarf.2013.ICTSP-06
Loading
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error