资源预览内容
第1页 / 共14页
第2页 / 共14页
第3页 / 共14页
第4页 / 共14页
第5页 / 共14页
第6页 / 共14页
第7页 / 共14页
第8页 / 共14页
第9页 / 共14页
第10页 / 共14页
亲,该文档总共14页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 12, Number 8, 2005 Mary Ann Liebert, Inc. Pp. 11031116 Genomic Classifi cation Using an Information-Based Similarity Index: Application to the SARS Coronavirus ALBERT C.-C. YANG, ARY L. GOLDBERGER, and C.-K. PENG ABSTRACT Measures of genetic distance based on alignment methods are confi ned to studying sequences that are conserved and identifi able in all organisms under study. A number of alignment-free techniques based on either statistical linguistics or information theory have been developed to overcome the limitations of alignment methods. We present a novel alignment-free approach to measuring the similarity among genetic sequences that incorporates elements from both word rank order-frequency statistics and information theory. We fi rst validate this method on the human infl uenza A viral genomes as well as on the human mitochondrial DNA database. We then apply the method to study the origin of the SARS coronavirus. We fi nd that the majority of the SARS genome is most closely related to group 1 coronaviruses, with smaller regions of matches to sequences from groups 2 and 3. The information based similarity index provides a new tool to measure the similarity between datasets based on their information content and may have a wide range of applications in the large-scale analysis of genomic databases. Key words: Shannon entropy, SARS coronavirus. INTRODUCTION G enetic distance measures are indicators of similarity among species or populations and are useful for reconstructing phylogenetic relationships (Graur and Li, 1999). Measures of genetic distance are mainly derived from examining each pair of sequences aligned nucleotide-by-nucleotide and estimating the number of substitutions. Since the mechanism of genome evolution relies not only on point-mutations but recombination or horizontal gene transfer from other species, the heterogeneity of gene segments will substantially degrade the accuracy of optimal sequence alignment methods, which are based on the estimation of nucleotide substitution. Therefore, alignment methods are confi ned to studying sequences that are conserved and identifi able in all organisms under study (Vinga and Almeida, 2003). Cardiovascular Division and Margret and H.A. Rey Institute for Nonlinear Dynamics in Medicine, Beth Israel Deaconess Medical Center/Harvard Medical School, Boston, Massachusetts 02215. 1103 1104YANG ET AL. An alternative approach is to develop alignment-free sequence comparison methods. Current alignment- free sequence comparison methods can be classifi ed into two categories (Vinga and Almeida, 2003): information theory-based (Li et al., 2001) and word statistics-based measures (Campbell et al., 1999; Qi et al., 2004; Chaudhuri and Das, 2002; Hao et al., 2003; Karlin and Burge, 1995; Qi et al., 2004; Stuart et al., 2002). We have developed a new index adapted from linguistic analysis and information theory to measure the similarity between symbolic sequences (Yang et al., 2003a, 2003b). Our approach is based on the concept that the information content in any symbolic sequence is primarily determined by the repetitive usage of its basic elements. The novelty of this information-based similarity index is that it incorporates elements of both information-based and word statistics-based categories since the rank order difference of each n-tuple (word statistics) is weighted by its information content using Shannon entropy (information theory) (Shannon, 1948). Furthermore, the composition of these basic elements captures both global information related to usage of repetitive elements in genetic sequences, as well as local sequence order determined by the n-tuple nucleotides. Hence, our method provides a complementary approach to overcoming limitations of alignment methods and is capable of exploring genetic sequences with hetero- geneic origins. The resulting measurement has been validated with respect to generic information-carrying symbolic sequences (Yang et al., 2003a, 2003b). Here we show the specifi c application of this method to genomic sequences. METHODS We have recently developed and validated a generic information-based similarity index to quantify the similarity between symbolic sequences. This method, which has been used for analysis of complex physiologic signals (Yang et al., 2003a) and literary texts (Yang et al., 2003b), can be readily adapted to genetic sequences by examining usages of n-tuple nucleotides (“words”). We fi rst determine the frequencies for each n-tuple by applying a sliding window (moving one nucleotide/step) across the entire genome, and then rank each n-tuple according to its frequency in descending order. To compare the similarity between genetic sequences, we plot the rank number of each n-tuple in the fi rst sequence against that of the second sequence. Figure 1 shows the comparison of 4-tuple nucleotide frequencies between the complete mitochondrial genome of two human lineages an
收藏 下载该资源
网站客服QQ:2055934822
金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号