The IGS Annotation Engine pipeline

Gene finding

For initial identification of protein coding sequences in prokaryotic genomes, we use the Glimmer3 algorithm (1), which identifies genes using interpolated Markov models. To find tRNAs we use the tRNAscan tool (2). Ribosomal RNA genes and other structural RNAs are identified directly from BLAST (4) search results or from matches to Rfam, a database of non-coding RNA families (3).

Homology searches

Pairwise protein searches. The putative genes identified by Glimmer are translated and run through a series of searches. One of the two main searches performed is a pairwise protein search program written at TIGR called BLAST_extend_repraze or BER. BLAST (4) searches are performed against a non-redundant protein database. A modified Smith-Waterman alignment (5) is then built between an extended version of the query and the significant BLAST matches. One important feature of the BER tool is that since it is fed the underlying DNA sequence of the gene in question, it is able to look in other frames and past stop codons for regions of similarity between two proteins. Therefore, if there is a sequencing error or a natural mutation that has split one gene into two (by a frameshift or in-frame stop codon), the BER tool creates an alignment across those two fragments. Many users have reported that the BER algorithm is an exceptional method of detecting regions with frameshifts that require subsequent evaluation for sequencing error or authentic mutations. The best-scoring 40 alignments for each protein are retained. The BER tool is open source and available at ber.sourceforge.net.

HMM searches. The second major search that is performed for each genome is an HMM database search. The database of HMMs used for these searches contain the TIGRFAM (6) and Pfam (7) datasets which together total well over 12,000 HMMs. All proteins from the Annotation Engine genomes are searched against the HMM database using HMMER (8). All HMM matches that score with an expect (e) value of less than 1 are retained in the annotation database for later evaluation.

Other search tools. All proteins are searched against the PROSITE database (9) a collection of amino acid sequence signatures that characterize protein families, domains, or functional sites such as binding sites or catalytic sites. This collection is maintained by the Swiss Institute of Bioinformatics and is available for download from their web site. In addition, all of the proteins are searched using two freely available tools developed by the Center for Biological Sequence Analysis SignalP (10) and TMHMM (11). SignalP predicts the presence of a putative signal sequence and is therefore helpful in predicting secreted proteins. TMHMM predicts membrane-spanning regions and is helpful in identifying potential membrane proteins. Potential lipoproteins are identified with a specific PROSITE motif. All proteins are also searched against the NCBI clusters of orthologous genes (COG) database (12).

Generation of initial automatic annotation

Once all of the searches are complete an automated process assigns preliminary annotation to each protein in the genome. This tool weighs evidence from a ranked list of evidence types contained in the BER and HMM search output. Each protein is assigned a descriptive common name coming from either a BER match protein name or an HMM name. Proteins predicted to encode enzymes are assigned Enzyme Commission (EC) numbers (13). The EC nomenclature system categorizes all known enzymatic reactions with unique id numbers. Genetic names (e.g., 'crecA') are assigned as appropriate. Gene Ontology (GO) (14) terms are assigned from HMM or BER results where the match has GO term assignments. All evidence is available for viewing by the user.

Providing the data to the user

Once all of the searches and automatic annotation are complete, the information in the IGS database is put into a MySQL database and placed onto a private FTP site along with all associated files (such as BER search output files) needed to run Manatee. At this point the Annotation Engine user can now begin the process of manual annotation using Manatee (if they so choose).

References

  1. Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007 Mar 15;23(6):673-9. Epub 2007 Jan 19.
  2. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997 Mar 1;25(5):955-64.
  3. Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR. Rfam: an RNA family database. Nucleic Acids Res. 2003 Jan 1;31(1):439-41.
  4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10.
  5. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195-7.
  6. Haft, D.H., J.D. Selengut, and O. White. The TIGRFAMs database of protein families. Nucleic Acids Res, 2003. 31(1): p. 371-3
  7. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR. The Pfam protein families database. Nucleic Acids Res. 2004 Jan 1;32 Database issue:D138-41
  8. Eddy S. Profile hidden Markov models. Bioinformatics. 1998;14(9):755-63. Review.
  9. Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A. The PROSITE database, its status in 2002. Nucleic Acids Res. 2002 Jan 1;30(1):235-8.
  10. Bendtsen, J.D., Nielson, H., von Heijne, G., Brunak, S. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol., 340:783-795, 2004.
  11. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
  12. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003 Sep 11;4:41. Epub 2003 Sep 11.
  13. The Enzyme Commission. Enzyme Nomenclature 1992 [Academic Press, San Diego, California, ISBN 0-12-227164-5 (hardback), 0-12-227165-3 (paperback)] with Supplement 1 (1993), Supplement 2 (1994), Supplement 3 (1995), Supplement 4 (1997) and Supplement 5 (in Eur. J. Biochem. 1994, 223, 1-5; Eur. J. Biochem. 1995, 232, 1-6; Eur. J. Biochem. 1996, 237, 1-5; Eur. J. Biochem. 1997, 250; 1-6, and Eur. J. Biochem. 1999, 264, 610-650; respectively)
  14. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000 May;25(1):25-9.