-
oa Efficient Sequence Alignment Using MapReduce on the Cloud
- الناشر: Hamad bin Khalifa University Press (HBKU Press)
- المصدر: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2011 Issue 1, نوفمبر ٢٠١١, المجلد 2011, CSOS1
ملخص
Over the past few years, advances in the field of molecular biology and genomic technologies have led to an explosive growth of digital biological information. The analysis of this large amount of data is commonly based on the extensive and repeated use of conceptually parallel algorithms, most notably in the context of sequence alignment. Cloud computing provides scientists with a completely new model of utilizing the computing infrastructure. Cloud computing model is excellent in dealing with such bioinformatics applications, which require both management of huge amounts of data and heavy computations.
The study aims at transforming a recently developed bioinformatics sequence alignment tool, named BFAST, to the cloud environment. The MapReduce version of the BFAST tool will be used to demonstrate the effectiveness of the MapReduce framework and the cloud-computing model in handling the intensive computations and management of the huge bioinformatics data.
A number of existing tools and technologies are utilized in this study to achieve an efficient transformation of the BFAST tool into the cloud environment. The implementation is mainly based on two core components; BFAST and MapReduce. BFAST is a software package for aligning next generation genomic reads against a target genome with a very high accuracy and reasonable speed. MapReduce general-purpose parallelization technology [in its open source implementation, Hadoop] appears to be particularly well adapted to the intensive computations and huge data storage tasks involved in the BFAST sequence alignment tool.
The MapReduce version of the BFAST tool is expected to offer better results than the original one in terms of maintaining good computational efficiency, accuracy, scalability, deployment and management efforts.
The study demonstrates how a general-purpose parallelization technology, i.e. MapReduce running on the cloud, can be tailored to tackle the class of bioinformatics problems with good performance and scalability, and, more importantly, how this technology could be the basis of a computational parallel platform for several problems in the context of bioinformatics. Although the effort of transforming existing bioinformatics algorithms from local compute infrastructure is not trivial, the speed and flexibility of cloud computing environments provide a substantial boost with manageable cost.