生物信息导论英文论文Practical Suffix Tree Construction 生物信息导论英文论文Practical Suffix Tree Construction-

Practical Suffix Tree ConstructionSandeep Tata Richard A. Hankins Jignesh M. PatelUniversity of Michigan5AbstractLarge string datasets are common in a numberof emerging text and biological database applications.Common queries over such datasets includeboth exact and approximate string matches. Thesequeries can be evaluated very efficiently by usinga suffix tree index on the string dataset. Althoughsuffix trees can be constructed quickly in memoryfor small input datasets, constructing persistenttrees for large datasets has been challenging.In this paper, we explore suffix tree constructionalgorithms over a wide spectrum of data sourcesand sizes. First, we show that on modern processors,a cache-efficient algorithm with O(n2) complexityoutperforms the popular O(n) Ukkonenalgorithm, even for in-memory construction. Forlarger datasets, the disk I/O requirement quicklybecomes the bottleneck in each algorithms performance.To address this problem, we present abuffer management strategy for the O(n2) algorithm,creating a new disk-based construction algorithmthat scales to sizes much larger than havebeen previously described in the literature. Ourapproach far outperforms the best known diskbasedconstruction algorithms.1 IntroductionQuerying large string datasets is becoming increasinglyimportant in a number of emerging text and life sciencesapplications. Life science researchers are often interestedin explorative querying of large biological sequencedatabases, such as genomes and large sets of protein sequences.Many of these biological datasets are growingat exponential rates for example, the sizes of the sequencedatasets in GenBank have been doubling every six-Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercialadvantage, the VLDB copyright notice and the title of the publication andits date appear, and notice is given that copying is by permission of theVery Large Data Base Endowment. To copy otherwise, or to republish,requires a fee and/or special permission from the Endowment.Proceedings of the 30th VLDB Conference,Toronto, Canada, 2004teen months 31. Consequently, methods for efficientlyquerying large string datasets are critical to the success ofthese emerging database applications.Suffix trees are versatile data structures that can helpexecute such queries very efficiently. In fact, suffix treesare useful for solving a wide variety of string based problems17. For instance, the exact substring matching problemcan be solved in time proportional to the length of thequery, once the suffix tree is built on the database string.Suffix trees can also be used to solve approximate stringmatching problems efficiently. Some bioinformatics applicationssuch as MUMmer 10, 11, 22, REPuter 23,and OASIS 25 exploit suffix trees to efficiently evaluatequeries on biological sequence datasets. However, suffixtrees are not widely used because of their high cost of construction.As we show in this paper, building a suffix treeon moderately sized datasets, such as a single chromosomeof the human genome, takes over 1.5 hours with the bestknown existing disk-based construction technique 18. Incontrast, the techniques that we develop in this paper reducethe construction time by a factor of 5 on inputs of thesame size.Even though suffix trees are currently not in widespreaduse, there is a rich history of algorithms for constructingsuffix trees. A large focus of previous research has been onlinear-time suffix tree construction algorithms 24, 32, 33.These algorithms are well suited for small input stringswhere the tree can be constructed entirely in main memory.The growing size of input datasets, however, requires thatwe construct suffix trees efficiently on disk. The algorithmsproposed in 24, 32, 33 cannot be used for disk-based constructionas they have poor locality of reference. This poorlocality causes a large amount of random disk I/O once thedata structures no longer fit in main memory. If we naivelyuse these main-memory algorithms for on-disk suffix treeconstruction, the process may take well over a day for asingle human chromosome.Large (and rapidly growing) size of many string datasetsunderscores the need for fast disk-based suffix tree constructionalgorithms. A few recent research efforts havealso considered this problem 4,18, though neither of theseapproaches scales well for large datasets (such as a largechromosome, or an entire eukaryotic genome).In this paper, we present a new approach to efficiently36construct suffix trees on disk. We use a philosophy similarto the one in 18. We forgo the use of suffix links in returnfor a much better memory reference pattern, which translatesto better scalability and performance for