-
oa Towards OpenDomain CrossLanguage Question Answering
- Publisher: Hamad bin Khalifa University Press (HBKU Press)
- Source: Qatar Foundation Annual Research Conference Proceedings, Qatar Foundation Annual Research Conference Proceedings Volume 2018 Issue 3, Mar 2018, Volume 2018, ICTPD881
Abstract
We present MATQAM (Multilingual Answer Triggering Question Answering Machine) a multilingual answer triggering open-domain QA system, focusing on answering questions whose answers might be in free texts in multiple languages within Wikipedia.Obtaining relevant information from the Web has become more challenging, since online communities and social media tend to confine people to bounded trends and ways of thinking. Due to the large amount of data available, getting the relevant information has become a more challenging task. Unlike in standard Information Retrieval (IR), Question Answering (QA) systems aim at retrieving the relevant answer(s) to a question expressed in natural language, instead of returning a list of documents. On the one hand, information is dispersed in different languages and needs to be gathered to get more knowledge. On the other hand, extracting answers from multilingual documents is a complicated task because natural languages follow diverse linguistic syntaxes and rules, especially for Semitic languages, such as Arabic. This project tackles open-domain QA using Wikipedia as source of knowledge by building a multilingual —Arabic, French, English— QA system. In order to obtain a collection of Wikipedia articles as well as questions in multiple languages, we extended an existing English dataset: WikiQA (Yang et al., 2015). We used the WikiTailor toolkit (Barrón-Cedeño et al., 2015) to build a comparable corpus form Wikipedia articles and to extract the corresponding articles in Arabic, French, and English. We used neural machine translation to generate the questions in the three languages as well. Our QA system consists of the three following modules. (i) Question processing consists of transforming a natural language question into a query and determining the expected type of the answer in order to define the retrieval mechanism for the extraction function. (ii) The document retrieval module consists of retrieving the most relevant documents from the search engines —in multiple languages— given the produced query. The purpose of this module is to identify the documents that may contain an answer to the question. It requires cross-language representations as well as machine translation technology to do that, as the question could be asked in Arabic, French or English and the answer could be in either of these languages. (iii) The answer identification module ranks specific text fragments that are plausible answers to the question. It first ranks the candidate text fragments in the different languages and, if they are found, they are combined into one consolidated answer. This is a variation of the cross-language QA scenario enabling answer triggering, where no concrete answer has to be provided, if it does not exist. In order to build our QA system, we extend an existing framework (Rücklé and Gurevych, 2017) integrating neural networks for answer selection. References Alberto Barrón-Cedeño, Cristina España Bonet, Josu Boldoba Trapote, and Luís Márquez Villodre. A Factory of Comparable Corpora from Wikipedia. In Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 3–13, Beijing, China, 2015. Association for Computational Linguistics. Andreas Rücklé and Iryna Gurevych. End-to-End Non-Factoid Question Answering with an Interactive Visualization of Neural Attention Weights. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations (ACL 2017), pages 19–24, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-4004. URL http://aclweb.org/anthology/P17-4004. Yi Yang, Wen-tau Yih, and Christopher Meek. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2013–2018, Lisbon, Portugal, 2015.