-
oa Solving All-Pairs Suffix Prefix – Theory and Practice
- الناشر: Hamad bin Khalifa University Press (HBKU Press)
- المصدر: Qatar Foundation Annual Research Conference Proceedings, Qatar Foundation Annual Research Conference Proceedings Volume 2016 Issue 1, مارس ٢٠١٦, المجلد 2016, HBPP2066
ملخص
The overlap stage is one of the most time- and space-consuming steps in de novo genome assembly. The huge size of output of Next Generation Sequencing (NGS), represented by small segments of multiple copies of the original genome, creates a serious computational challenge. The target is to find overlaps between each pair of sequences (reads) in order to build an overlap-based graph which constitutes the input for the assembly stage in which longer sequences (contigs) are sent to output. Finding overlaps between each pair of the input reads can be performed by solving the all-pairs suffix prefix problem (APSP) for these sequences. Such a problem, if not addressed effectively by designing new algorithms, data structures, and parallel processing techniques, can create an obstacle to the target of making genome assembly possible even with limited resources. This work presents a survey for previously presented solutions for APSP. We demonstrate and classify these techniques. Showing recent results regarding the time and space consumption for these solutions, we draw a conclusion regarding the best direction to tackle this critical problem. All-pairs suffix prefix is a well-known computer science problem. For a group of sequences G = {S1, S2, S3, …. Sk), finding all pairs suffix prefix is to find the longest suffix prefix match for each ordered pair in G. For instance, let G = {AACCGT, TAAAC, ACCCTA}. The solution to all-pairs suffix prefix for G is: (1,2) = {T}, {1,3} = {}, (2,1) = {AAC}, (2,3) = {AC}, {3,1} = {A} and (3,2} = {TA}.
The first optimal solution for APSP was introduced by Gusfield et al. (D. Gusfield, 1992). The algorithm was based on the generalized suffix tree and takes O(n+k2) time, where n is the total length of all k strings. The suffix tree data structure is one of the most important tools for string matching and its applications in bioinformatics. A suffix tree of a sequence S is an index structure in which each suffix of S is stored as a path from the root to a leaf. Obviously many suffixes will share partial path before they end in different leaves.
Despite the fact that suffix tree's performance in solving many genome analysis problems is considered optimal, suffix tree is expensive in term of space. A pointer-type suffix tree typically requires O(n log n) space, whereas the original text over an alphabet of size requires only O(n log) bits. In addition, suffix tree has a bad locality of memory reference, which causes a remarkable slowdown in cached processor architectures.
A practically faster and less space-consuming solution for APSP was presented by Ohlebusch and Gog in 2010 (E. Ohlebusch and S. Gog, 2010), using the enhanced suffix array (M. Abouelhoda, 2004). Suffix array has emerged as a substitute to suffix tree in order to avoid its high space consumption, and to achieve better locality. The suffix array SA of a string S is an array of integers including the positions of the lexicographically sorted suffixes of S; i.e., for any two integers 0 i’ < i’’ < n, S[SA[i’]] is lexicographically less than S[SA(i’’)]. Several indices were introduced in the last decade as compressed versions of suffix trees and suffix arrays. Those indices were considered a miracle since they don't just offer indexed searching, but also extract any text substring from the original text. With these self-index data structures, the original text is unnecessary and can be discarded. The term self-index highlights the fact that text is not stored explicitly but it can be derived from the index. FM index and RLCSA are examples for compressed suffix array, while Sadakane tree is a good example for compressed suffix tree.
Simpson and Durbin (J.T. Simpson and R.Durbin, 2012) used the FM index (P. Ferragina, 2004) to solve APSP in an indirect way as follows. The index is constructed for all strings after concatenating them in one string. The index is then queried by the reads, one by one, to find prefix-suffix matches. The time complexity of this algorithm is not as optimal as the one of (D. Gusfield, 1992), because one examines more suffixes than the output size. (This limitation stems also from the fact that the FM index lacks structural information to run the algorithms of (D. Gusfield, 1992) or (E. Ohlebusch and S. Gog, 2010) on it.). However, its space consumption is much less than that of the previous algorithms. The resulting assembler was called SGA.
Rachid et al. (M. H. Rachid Q. M., 2014) presented two solutions to solve APSP using a Sadakane suffix tree. The Sadakane suffix tree is a self-index and fully functional in a similar way to the uncompressed version of suffix tree. It offers the typical suffix tree operations such as checking if a node is a leaf, moving to the next sibling, using a suffix link, or even performing lowest common ancestor queries which can be expensive in term of time with other compressed data structure such as FM. Rachid et al. (M. H. Rachid Q. M., 2014) took advantage of the different components of this tree to solve APSP. The authors also utilized these components in building two different parallel implementations of these solutions. Rachid et al. (M. Haj. Rachid, 2014) also presented a solution to APSP using RLCSA. RLCSA is a compressed suffix array, which is introduced in (J. Sirén, 2008), (J.Sirén, 2010), and (V. Mäkinen, 2009). Both solution achieved considerable space reduction, but with a big slowdown.
Readjoiner (G. Gonnella & S. kurtz, 2012) was recently proposed as an efficient genome assembler that, in the overlap stage, finds suffix-prefix matches with a minimal length ℓ by grouping all relevant suffixes in buckets. Each bucket is identified by a common prefix for all suffixes inside it. Then, after sorting suffixes inside each bucket, it finds suffix-prefix matches using the lcp-intervals concept which is introduced in (M.I. Abouelhoda, 1994). The overlap stage in Readjoiner achieved best time and space consumption over all earlier solutions. Rachid and Malluhi (M. H. Rachid Q. M., 2015) presented a solution, called SOF, for APSP using a compact prefix tree. A prefix tree is a tree in which every read is represented by one path from the root to a leaf. The internal nodes represent common prefixes between reads. SOF could not achieve better results than Readjoiner using a single thread; however it shows better scalability and better time and space consumption in a multithreading environment. Figures 1 and 2 show the time and space consumptions for 7 different solutions to APSP when running on randomly generated data (M. H. Rachid Q. M., 2015).
Figure 1. Time consumptions for 7 solutions running on randomly generated data
Figure 2. Space consumptions for 7 solutions running on randomly generated data
We conclude that using suffix tree/array-based solutions, with standard or compressed forms, is not the best direction to solve APSP, despite the fact that they may possess an optimal time complexity. It turns out that other techniques which are based on LCP interval can achieve practically superior results in term of time without the large space which a suffix tree/array requires.
Bibliography
D. Gusfield, G. L. (1992). An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett, 41(4):181–185.
E. Ohlebusch and S. Gog. (2010). Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett, 110(3):123–128.
G. Gonnella, & S. kurtz. (2012). Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC Bioinformatics, 13:82.
J. Sirén, N. V. (2008). Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections. in Amihood Amir; Andrew Turpin & Alistair Moffat, pp. 164–175.
J. Sirén. (2010). Sampled Longest Common Prefix Array. CoRR abs.
J. T. Simpson and R. Durbin. (2012). Efficient de novo assembly of large genomes using compressed data structures. Genome research, 22(3):549–556.
M. Abouelhoda, S. K. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 53–86.
M. H. Rachid, Q. M. (2014). Using the sadakane compressed suffix tree to solve the all-pairs
suffix-prefix problem. BioMed research international.
M. H. Rachid, Q. M. (2015). A practical and scalable tool to find overlaps between sequences. BioMed Research International, 12.
M. Haj. Rachid, Q. M. (2014). A space-efficient solution to find the maximum overlap using a compressed suffix array. Biomedical Engineering (MECBME), 329–333.
M. I. Abouelhoda, S. K. (1994). Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms, 2, 53–86.
P. Ferragina, G. M. (2004). An alphabet-friendly fm-index. In SPIRE, 150–160.
V. Mäkinen, G. N. (2009). Storage and Retrieval of Individual Genomes. in Serafim Batzoglou, ed., ‘RECOMB’, Springer., pp. 121–137.