For initial identification of protein coding sequences in prokaryotic genomes, we use the Glimmer3 algorithm (1), which identifies genes using interpolated Markov models. To find tRNAs we use the tRNAscan tool (2). Ribosomal RNA genes and other structural RNAs are identified directly from BLAST (4) search results or from matches to Rfam, a database of non-coding RNA families (3).
Pairwise protein searches. The putative genes identified by Glimmer are translated and run through a series of searches. One of the two main searches performed is a pairwise protein search program written at TIGR called BLAST_extend_repraze or BER. BLAST (4) searches are performed against a non-redundant protein database. A modified Smith-Waterman alignment (5) is then built between an extended version of the query and the significant BLAST matches. One important feature of the BER tool is that since it is fed the underlying DNA sequence of the gene in question, it is able to look in other frames and past stop codons for regions of similarity between two proteins. Therefore, if there is a sequencing error or a natural mutation that has split one gene into two (by a frameshift or in-frame stop codon), the BER tool creates an alignment across those two fragments. Many users have reported that the BER algorithm is an exceptional method of detecting regions with frameshifts that require subsequent evaluation for sequencing error or authentic mutations. The best-scoring 40 alignments for each protein are retained. The BER tool is open source and available at ber.sourceforge.net.
HMM searches. The second major search that is performed for each genome is an HMM database search. The database of HMMs used for these searches contain the TIGRFAM (6) and Pfam (7) datasets which together total well over 12,000 HMMs. All proteins from the Annotation Engine genomes are searched against the HMM database using HMMER (8). All HMM matches that score with an expect (e) value of less than 1 are retained in the annotation database for later evaluation.
Other search tools. All proteins are searched against the PROSITE database (9) a collection of amino acid sequence signatures that characterize protein families, domains, or functional sites such as binding sites or catalytic sites. This collection is maintained by the Swiss Institute of Bioinformatics and is available for download from their web site. In addition, all of the proteins are searched using two freely available tools developed by the Center for Biological Sequence Analysis SignalP (10) and TMHMM (11). SignalP predicts the presence of a putative signal sequence and is therefore helpful in predicting secreted proteins. TMHMM predicts membrane-spanning regions and is helpful in identifying potential membrane proteins. Potential lipoproteins are identified with a specific PROSITE motif. All proteins are also searched against the NCBI clusters of orthologous genes (COG) database (12).
Once all of the searches are complete an automated process assigns preliminary annotation to each protein in the genome. This tool weighs evidence from a ranked list of evidence types contained in the BER and HMM search output. Each protein is assigned a descriptive common name coming from either a BER match protein name or an HMM name. Proteins predicted to encode enzymes are assigned Enzyme Commission (EC) numbers (13). The EC nomenclature system categorizes all known enzymatic reactions with unique id numbers. Genetic names (e.g., 'crecA') are assigned as appropriate. Gene Ontology (GO) (14) terms are assigned from HMM or BER results where the match has GO term assignments. All evidence is available for viewing by the user.
Once all of the searches and automatic annotation are complete, the information in the IGS database is put into a MySQL database and placed onto a private FTP site along with all associated files (such as BER search output files) needed to run Manatee. At this point the Annotation Engine user can now begin the process of manual annotation using Manatee (if they so choose).