AutomatingOntologicalFunctionAnnotationTowardsaCommonMethodologicalFramework-

Automating Ontological Function Annotation: Towards a Common Methodological FrameworkAutomating Ontological Function Annotation: Towards a Common Methodological FrameworkCliff A Joslyn*, Judith D Cohn, Karin M Verspoor, and Susan M MniszewskiLos Alamos National Laboratory, Los Alamos, NM, USA3* To whom correspondence should be addressed: MS B265 LANL, Los Alamos, NM 87545 USA, joslynlanl.govABSTRACTMotivation: Our work in the use of ontology categorization for functional annotation is motivating our focus on an overall methodological framework for ontological function annotation (OFA). We draw on our experiences to discuss test set selection, annotation mappings, evaluation metrics, and structural ontology measures for general OFA.1 INTRODUCTION A new paradigm for functional protein annotation is the use of automated knowledge discovery algorithms mapping sequence, structure, literature, and/or pathway information about proteins whose functions are unknown into a functional ontology, typically (a portion of) the Gene Ontology (GO, GO Consortium 2000) http:/www.geneontology.org. For example, our own work (Verspoor et al. 2004, 2005) involves analyzing collections of GO nodes (e.g. annotations of protein BLAST neighborhood) using the POSet Ontology Categorizer (POSOC, Joslyn et al. 2004) http:/www.c3.lanl.gov/joslyn/posoc.html to produce new annotations. Both in executing this work and in examining similar efforts (e.g. Pal and Eisenberg 2005, Martin et al 2004), we have uncovered a variety of methodological issues which we believe could be valuable for the community to focus on. Here we first explicate our sense of a generic architecture for automated ontological functional annotation (OFA) into the GO, and then discuss specific methodological issues which are generic to OFA, illustrated by our own experience. 2 GENERIC AUTOMATED OFAA simple formulation for protein function annotation into the GO assumes a collection of genes or proteins X and a set of GO nodes (perhaps for a particular branch) P. Then in the most general sense, annotation is a function assigning each protein a collection of GO nodes . So while a known protein x may have a known set of annotations F(x), a new protein y may not have any known annotations, and instead we wish to build some method G returning a predicted set of GO nodes . Typically, we have information about y such as sequence, structure, interactions, pathways, or literature citations, and to build G we exploit knowledge of the proteins “near” y in that space which have known functions. In a testing situation, we take a known protein x and compare its known annotations F(x) against its predicted annotations G(x). Thus to measure the accuracy of our prediction G, we need to compare two different sets of GO nodes, F(x) and G(x), against each other over the set of known proteins X.3 METHODOLOGICAL ISSUES We now briefly survey the methodological issues we will explicate completely in the presentation and full paper.3.1 Protein Test SetsFirst we select one or more gold standard test sets X of proteins with trusted annotations in the GO. While any such test set should be shared within the community, nonetheless requirements for a gold standard will vary among research groups. POSOC currently needs a test set containing both sequence and structure data, and so we use Swiss-Prot protein sequences with existing PDB structures http:/www.rcsb.org/pdb. Other groups have used a variety of test sets, for example Pal and Eisenberg (2005) use a set of protein sequences from the FSSP structure library http:/www.chem.admu.edu.ph/nina/rosby/fssp.htm to evaluate their ProKnow system; Martin et al (2004) use sequence data from seven complete genomes to test GOtcha. A further consideration is non-redundant test data which is sampled to avoid over-representation in any part of the test space. For example, the non-redundant Astral subsets of SCOP domains are designed to cover the variation in SCOP structure space while ensuring that no two SCOP domains in a particular subset have a sequence homology greater than a specified cutoff value (e.g. 95% or 40%) (Chandonia et al. 2004). We propose development of a non-redundant test set covering GO function space. 3.2 Annotation MappingsThe value of any gold standard is very much tied to the accuracy of their known annotations F. POSOC uses the GOA http:/www.ebi.ac.uk/GOA UniProt http:/www.ebi.ac.uk/uniprot/index.html annotation set for protein sequences, and it could be useful for this set, or other annotations for other data types, to be regularized as a community standard to provide a means of comparing various studies, including studies attempting to create better annotation sets. Extension to include the source of annotations for a particular type of data and a common ranking for the evidence codes included in GO annotation files (e.g. IC = inferred by curator, IEA = inferred from el