外文翻译--Using AdaboostSVM to predict the GPCR functional Classes-

Using AdaboostSVM to predict the GPCR functional Classes Wang-Ren Qiu School of Information Engineering Jingdezhen Ceramic Institute Jingdezhen, China qiuone163.com Xuan Xiao School of Computer Engineering Jingdezhen Ceramic Institute Jingdezhen, China xiaoxuan0326yahoo.com.cn Zhen-Yu Zhang College of Science, Dalian Jiaotong university Dalian,ChinaAbstractAdaBoost incorporating properly designed RBFSVM is a popular boosting method and demonstrates better generalization performance than SVM on imbalanced classification problems. This paper discusses the application of AdaBoostSVM algorithms to the problem of G-proteincoupled receptor classes prediction in which the pseudo amino acid composition is derived by combining “the cellular automaton image” and the Ziv-Lempel complexity. In the experiments of classifing GPCRs form no-GPCRs and GPCRs six main families, the jackknife test overall accuracy rates are 96.8% and 89.04%, respectively. The experimental results suggest that the AdaBoostSVM holds potential to be a useful algorithm for understanding the functions of GPCRs and other proteins. Keywords-Cellular automaton image, Pseudo amino acid composition, AdaBoostSVM, G-proteincoupled receptor I. INTRODUCTION G protein-coupled receptors (GPCRs) transmit extracellular signals into the intracellular space. Playing a key role in cellular signaling networks and being the largest family of cell surface receptors, GPCRs are potential drug targets. At present more than 50% of drugs available on the market act through GPCRs. Therefore, GPCRs are central to a broad range of physiologic events, including heart rate, proliferation, neurotransmission and hormone secretion 1. Much effort has been invested in GPCR study by both academic institutions and pharmaceutical industries. For example, the covariant discriminant algorithm was introduced to identify the 566 GPCRs 2. Later, a similar approach was used to study the 1,238 GPCRs classified into three main families 3. Stimulated by the encouraged results, some follow-up studies were conducted as reported in 4,5, a cellular automaton image approach for predicting GPCRs functional classes has been developed by Xiao et al 6. Boosting is a popular machine learning technique, and has been stimulated to deal with various different problems, such as customer churn prediction 7, protein fold recognition 8, analog circuit fault diagnosis 9, face verification 10, target track method 11, the retaining wall method selection in construction 12, synthetic aperture radar automatic target recognition 13. Li Xuchun et.al 14 have developed an AdaBoostSVM algorithm in which properly designed SVM with the RBF kernel classifiers are used as components of AdaBoost scheme. The success of boosting can be attributed to its ability to enlarge the margin, which could enhance the generalization capability of AdaBoost. However, there is still no study concerning the application of AdaBoostSVM in GPCRs classification although it provides a good way to the classification. This paper provides a comprehensive study and discusses the application of AdaBoostSVM algorithms to the problem of above in GPCRs prediction, and comprises five sections: Section 2 introduces boosting and the proposed algorithm, robust AdaBoostSVM. Section 3 briefly introduces the materials and method of GPCRs prediction in this paper. Experimental results and discussion are provided in section 4. Finally, Section 5 concludes the paper. II. BOOSTING AND ADABOOSTSVM A. Boosting The basic idea of boosting is to repeatedly apply a weak learner to modified versions of the training data, thereby producing a sequence of weak classifiers for a predefined number of iterations. In simple terms, boosting algorithms combine weak learning models that are slightly better than random model; the boosting meta-algorithm is an efficient, simple, and easy to manipulate machine learning technique. The boosting algorithm can be summarized as following: Step 1: Let all the data points are initialized with uniform weights; Setp 2: Compute the errorof the weak model; Step 3: Ifequals to zero, orno less than 0.5 terminate the model generation. If not, then for each instance, if instance classified correctly by the model multiply the weight of instance by1. where the weight of the correctly classified instances are assigned lower value, while the incorrectly classified instances will get higher weights; Step 4: Normalize the weights for all instances; Step 5: For each of the t models, add 1log() to the weight of the class predicted by model. The final model obtained by boosting algorithm is a linear combination of several weak learning models weighted by their own performance. With boosting method, an algorithm becomes stronger for the reason of boosting method aims to promote a learning algorithm and improve the learning accuracy of the algorithm. One objective of boosting is that the basic classifier trains with the sample data which include the richest information in each training time, and the final learning result is depended upon each basic classifiers outcome. Foundation Items: The National Natural Science Foundation of China (No. 60661003 and No.60961003) 978-1-4244-4713-8/10/$25.00 2010 IEEEB. AdaBoostSVM Support Vector Machine (SVM) is a popular classification algorithm based on statistical learning theory developed by Vapnik and his colleagues. It creates a decision boundary by mapping input vectors to a high dimensional space where maximum-margin linear hyper-plane separates different classes in the training data. SVM is developed from the theory of Structural Risk Minimization. The generalization performance of SVM is mainly affected by the kernel parameters and the regularization parameter. They have to be set beforehand, and any variation of either of them leads to the change of classification performance. To know more details, the interested readers can refer to Vapniks literatures 15, 16. AdaBoost method calls a given basic learning algorithm repeatedly in a series of rounds. One main idea of AdaBoost is to maintain a distribution or a set of weights over a given training examples. The AdaBoost and SVM are essentially the same except for the way they measure the margin or the way they optimize their weight vector. Both of them focus on the richest information training examples which make them have good learning performance. Li et.al proposed AdaBoostSVM algorithm and proved its efficiency on imbalanced classification problems. In order to deal with multicategory problems, one-versus-one scheme is used and be combined with multiple binary classifiers in this paper. III. MATERIALS AND METHODS A. Data sets As a demonstration, let us use the same dataset with 6. The data contains 365 GPCRs, of which 232 belonged to rhodopsin-like(RL),44 to metabotrophic/glutamate/pheromone (MGP) ,39 to secretin-like(SL), 23 to fungal pheromone(FP), 10 to cAMP, and 17 to frizzled/smoothened family(FSF), and 365 non-GPCRs. To guarantee the quality, the data were screened strictly according to the following criteria. First, all of the incomplete sequences were removed. Second, to avoid any homology bias, a redundancy cutoff was imposed with the program CD-HIT 17 to winnow those sequences which have more than 40% pairwise sequence identity to any other in a same subset except for the fifth class, because it contained only 10 GPCR proteins. B. Methods A feasible approach applied the pseudo amino acid (PseAA) composition to represent the protein sample for addressing the problem of losing information 18-21. The PseAA composition was originally proposed for predicting protein subcellular localization and membrane protein type 22, and the amphiphilic PseAA composition 2 was proposed for predicting the enzyme functional classification by Chou. According to its definition, the PseAA composition for a given protein sample is expressed by a set of 20 +discrete numbers, where the first 20 represent the 20 components of the classical amino acid composition while the additionalnumbers incorporate some of its sequence-order information via various different kinds of coupling modes. Ever since the concept of PseAA composition was introduced, various PseAA composition approaches have been stimulated to deal with various different problems in proteins. In this study, a novel approach by combining “the cellular automaton image” 23, 24 and the Ziv-Lempel complexity 25-27 was introduced to derive the PseAA components. The CA images can reveal many important features of proteins, and have been applied to predict the effect on the replication ratio by HBV virus gene missense mutation 28 and predict the protein subcellular localization 29. C. Cellular Automaton Image and Ziv-Lempel Complexity A protein sequence is formed by 20 native amino acids whose single character codes are: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. It is very difficult to find its characteristic pattern particularly when the sequence is very long. To cope with this situation, we resort to the Ziv-Lempel complexity derived from the amino acid sequence by means of cellular automaton image 23, 24. As a first step, each of the 20 native amino acids is coded in a binary mode according to Table 1, which can better reflect the chemical and physical properties of an amino acid, as well as its structure and degeneracy. Through the above encoding procedure, a protein sequence is transformed to a series of digital signals. For example, the sequence MASA is accordingly transformed to 10011110010100111001. TABLE I. TABLE 1 THREE DIFFERENT TYPES FOR CODING AMINO ACIDS Type Code Character P L Q H R S F Decimal 1 3 4 5 6 9 11 Binary 00001 00011 00100 00101 00110 01001 01011 Character T I M K N A V Decimal 16 18 19 20 21 25 26 Binary 10000 10010 10011 10100 10101 11001 11010 Character Y W C D E G Decimal 12 14 15 28 29 30 Binary 01100 01110 01111 11100 11101 11110 The concept of cellular automata has attracted a great deal of interest in recent years because many extremely complex patterns can be evolved by just repeatedly applying some very simple rules. This is particularly useful for emulating complicated physical, social, and biological systems. In this study, the practical approach to generate the CA image for a given protein sequence can be described as follows. Suppose a protein P consists of N amino acids, i.e., 12NPR RR= (1) Where R1 represents the first residue of the protein, R2 represents the second one, and so forth. According to Table I, the residue chain of (1) is initially converted to a sequence with 5N digits. 125( )( )( )( )( )NNP tg t g tgtgt= (2) where gi(t)= 0 or 1 (i=0,1,2,5N) as defined by Table 1. The binary digital codes listed above were derived using a model based on the similarity rule, complementarily rule, molecular recognition theory, and information theory. Such a model can better reflect the amino acid chemical physical properties and their degeneracy. 111211311411511678( )0,( )0,( )0( )0,( )0,( )1,( )0,( )1,( )0,( )0,( )1,( )1(1),( )1,( )0,(,itiiiitiiiitiiiitiiiiitiiiitititgifgtg tgtgifgtg tgtgifgtg tgtgifgtg tgtg tgifgtg tgggg+=+=111111)0( )1,( )0,( )1( )1,( )1,( )0( )1,( )1,( )1iiiiiiiiitifgtg tgtifgtg tgtifgtg tgt+= (3) Where g0(t)=g5N(t) , g1(t)=g5N+1(t) ,t=0,1,2, gkit(t)=0 or 1 k=1,2,8，when g=101110002, we get rule 184.when t=100 and rule is 184, we know that the image texture is basically steady and the cellular automata images for proteins with a same biochemical attribute generally share a similar texture pattern 6. According to Wolframs theory 24, each protein sequence is corresponding to a CA image with its own textural feature. Thus, those proteins which belong to the same GPCR class must have some similar textures in their CA images. But how do we extract the textural features that can give the greatest information pertaining to each texture? In other word, how to optimally extract these features and formulate them as a set of parameters is an important problem yet to be solved. Among known measures of complexity, the Ziv-Lempel complexity measure reflects most adequately repeats occurring in the text. The work of proposing complexity to measure sequences has been motivated by the notion that the DNA sequences are known to be low abundant. Therefore the compression technique based on the Ziv-Lempel complexity does not give any perceptible compression ratio in comparison with simple encoding procedures, and is an approach to the problem of evaluating the complexity of finite sequences. Some measures related to the Ziv-Lempel complexity have been used for recognition of structural regularities in DNA sequences. Then, in this paper, we try to construct the PseAA composition with Ziv-Lempel complexity. For the convenience of computing, let 5= and choose the rule 184 in our experiment, in other words, a protein sequence is expressed by a 25 dimensional vector. According to the PseAA composition 29 discrete model, the protein P can be formulated as 123(20 5)(,)TXxxxx+=?. +=)2521( ,)201 ( ,20151)20()20(20151kpwfpwkpwffxijjjikkijjjikk (4) where T is the transpose operator, and where fi (i =1, 2, , 20) are the occurrence frequencies of the 20 native amino acids in the protein, arranged alphabetically according to their single letter codes, Pj (j =1, 2,5) are derived from geometric moments of the cellular automaton image of protein P as given by the Ziv-Lempel complexity measures. IV. RESULTS AND DISCUSSION As is well known, the single-test-set analysis, sub-sampling and jackknife analysis are the three methods often used for cross-validation examination. However, during the process of the single-test-set test, the rule parameters derived from the training data set include the information of query receptor later plugged back in the test. This will certainly underestimate the error and enhance the success rate because the same receptors are used to derive the rule parameters and to test themselves. In comparison with the single-set-test examination and sub-sampling analysis, the jackknife test, also called the leave-one-out test, seems to be most effective. In the jackknife test, each domain in the dataset is singled out in turn as a test domain and all the rule-parameters are determined from the remaining N-1 domains. TABLE II. SUCCESS RATES PERFORMANCE COMPARISONS FOR GPCR AND NON-GPCR PROTEINS Protein type Xiao 2009s6 AdaBoostSVM GPCR (365) 92.33% (337) 97.81% (357) Non-GPCR (365) 90.96% (332) 96.16% (351) Overall (730) 91.64% (669) 96.97% (708) TABLE III. SUCCESS RATES PERFORMANCE COMPARISONS FOR SIX MAIN GPCR FAMILIES GPCR main family Xiao 2009s6 AdaBoostSVM RL (232) 96.55% (224) 97.84% (227) SL (39) 74.36% (29) 74.36% (29) MGP (44) 81.82% (36) 77.27% (34) FP (23) 8.70% (2) 65.22% (15) cAMP (10) 60% (6) 90% (9) FSF (17) 47.06% (8) 64.71% (11) Overall (365) 83.56% (305) 89.04% (326) The jackknife success rates obtained with the AdaBoostSVM in identifying proteins as GPCR or non-GPCR are given in Table ii, whereas those in identifying the GPCR proteins among their six main functional classes are given in Table iii. The weight coefficient for the PseAA components is 1/2700. The results depict that the method can distinguish GPCR proteins from non-GPCR proteins with accuracy about 97% whereas that in identifying GPCR proteins among their six main functional classes is more than 89% when evaluated through jackknife test and based on the cellular automaton rule 184. The overall accuracy of our method for subfamilies isnt less than 64.7%, in other word, this method be good for every subfamilies, whereas those of some subfamilies are much less than the others in some papers 6,30 . According to these lectures, the success accuracy within subfamilies is about 45.7% which higher than the overall success rate on the assumption that the GPCR samples are completely randomly distributed among the six possible categories. These results are all better than those obtained by the approach of the cellular automaton image based on the gray level co-occurrence matrix, and indicating that the AdaBoostSVM predictor can yield quite reliable results, even for the stringent benchmark dataset in which none of protein samples has 40% pairwise sequence identity to any other in a same subset. V. CONCLUSIONS Its easy to see from the experimental results and analysis that the AdaBoostSVM classifier outperforms the covariant discriminant algorithm in GPCRs prediction process on the basis of cellular automation composition, and experiments demonstrate that using the Ziv-Lempel complexity extracted from the CA images of proteins can reflect their overall sequence patterns more effectively so as to enhance the power in identifying GPCR functional classes. It is anticipated that the novel approach can also be used to improve the prediction quality for a series of other protein attributes, such as subcellular localization, membrane types, enzyme family and subfamily classes, and among many others. REFERENCES 1 Spiegel AM, Shenker A, Weinstein LS. Receptor-effector coupling by G proteins: implications for normal and abnormal signal transduction. Endocr Rev. 1992 ,13(3):pp.536565. 2 Chou, K. C.; Elrod, D. W. Bioinformatical analysis of G-proteincoupled receptors. J. Proteome Res. 2002, 1, pp.429-433. 3 Chou,K.C. Prediction of G-Protein-Coupled Receptor Classes Proteome Res 2005,4,pp.1413-1418. 4 Gao, Q. B.; Wang, Z. Z. Classification of G-protein coupled receptors at four levels Protein Eng Des Sel 2006, 19, pp.511-516. 5 Wen Z, Li ML, Li YZ, Guo YZ, Wang KL. Delaunay triangulation with partial least squares projection to latent structures: a model for G-protein coupled receptors classification and fast structure recognition. Amino Acids,(2007)32:pp.277283. 6 Xiao, X.,Wang P, Chou, K.C. GPCR-CA: A Cellular Automaton Image Approach for Predicting G-ProteinCoupled Receptor Functional Classes Journal of Computational Chemistry 2009,9, pp.1414-1423. 7 Shao Jinbo, Li Xiu, Liu Wenhuang The Application of AdaBoost in Customer Churn Prediction Service Systems and Service Management, 2007 International Conference pp.1-6. 8 Yazhene Krishnaraj and Chandan K. Reddy Boosting methods for Protein Fold Recognition:An Empirical Comparison IEEE International Conference on BIBM.2008.83 pp.393-396. 9 Tang Jingyuan, Shi Yibing Analog Circuit Fault Diagnosis Using AdaBoost and SVM ICCCAS 2008,1184-1187. 10 Zhou Mian and Wei Hong Face Verification Using GaborWavelets and AdaBoost ICPR06 Proceedings of the 18th International Conference on Pattern Recognition Volume 01, pp.404-407. 11 Song Hua-jun，Shen Mei-li , Liu Wei-feng A Specific Target Track Method Based on SVM and AdaBoost ISCSCT.2008.13 pp.360-363. 12 Yoonseok Shin; Dae-Won Kim; Jae-Yeob Kim; Kyung-In Kang; Moon-Young Cho; and Hun-Hee Cho Application of AdaBoost to the Retaining Wall Method Selection in Construction Journal of Computing in Civil Engineering ASCE 2009 pp.188-192. 13 Wang Ying, Han Ping, Lu Xiaoguang, Wu Renbiao, Huang Jingxiong The Performance Comparison of Adaboost and SVM Applied to SAR ATR Radar, CIE 06.1-4. 14 Li Xuchun, Wang Lei, Sung Eric, AdaBoost with SVM-based Component Classifiers, Engineering Applications of Artificial Intelligence, 2008(21), 5, pp.785-795. 15 Valentini,G.,Dietterich,T.G.Bias-variance analysis of Support Vector Machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research, (2004)5,pp. 725-775. 16 Vapnik, V, The Nature of Statistical Learning Theory. Springer, New York. 1995 17 Li, W.; Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, pp.1658-1659. 18 Chou KC, Shen H.B. Cell-PLoc: A package of web-servers for predicting subcellular localization of proteins in various organisms. Nature Protocols 2008, 3:pp.153-162. 19 Emanuelsson, O.; Brunak, S.; von Heijne, G.; Nielsen, H. Locating proteins in the cell using TargetP, SignalP and related tools. NatProtoc 2007, 2,pp. 953-971. 20 Garg, A.; Bhasin, M.; Raghava, G. P. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search.J Biol Chem 2005, 280,pp.14427-14432. 21 Nair,R.;Rost,B.Sequence conserved for subcellular localization. Protein Sci 2002,11,pp.2836-2847. 22 Chou,K.C. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct. Funct. Genet. (Erratum:ibid., 2001, Vol. 44, 60) 2001, 43, pp.246-255. 23 Wolfram S.Cellular automation as models of complexity, Nature, 1984; 311(4): 419-424. 24 Wolfram, S. A New Kind of Science; Wolfram Media: Champaign,Illinois, 2002. 25 Diao Y., D. Wen,Ma1, Z. J. Xiang Yin, J., and Li M. Using pseudo amino acid composition to predict transmembrane regions in protein: cellular automata and Lempel-Ziv complexity Amino Acids (2008) 34: pp.111117. 26 Lempel,A.and Ziv,J. On the complexity of finite sequences. IEEE Trans. Inf. Theory 1976,22 pp.75-81. 27 Vladimir D., Lubov A. Nadia A.On the complexity measures genetic sequences. Bioinformatics (1999)15, pp.994-999. 28 Xiao, X.; Shao, S.; Ding, Y.; Huang, Z.; Chen, X.; Chou, K.C.J An application of gene comparative image for predicting the effect on replication ratio by HBV virus gene missense mutation. J Theor Biol. 2005,21;235(4):555-65. 29 Xiao X, Shao SH, Ding YS, Huang ZD Huang Y, Chou KC.Using complexity measure factor to predict protein subcellular location. Amino Acids 2005(28): 5761. 30 Chou K.C., A novel approach to predicting protein structural classes in a (201)-D amino acid composition space, Proteins Struct. Funct. Genet. 21 (1995) pp.319344.