Title: Automated Annotation of Microbial Genomes, Opportunities and Pitfalls
1Automated Annotation of Microbial Genomes,
Opportunities and Pitfalls
- Margie Romine
- Pacific Northwest National Laboratory
- Richland, Washington
2Shewanella oneidensis MR-1
- Breathes Mn Fe and other metals thereby
changing their solubility - Also reduces radionuclides and hence impacts
their mobility at contaminated sites - Genome sequenced by the Institute for Genome
Research in 2002 (funded by DOE-OBER) - Can we now better determine how this organism
interacts with metals and radionuclides?
3Shewanella spp. Inhabit Many Niches
2 more were sequenced by DOEs Joint Genome
Institute and 14 more are under way!
- Energy rich - fermentation is occurring and
energy is continuously being deposited via
sedimentation - Rapidly changing redox conditions/dominant
electron acceptors - Microbial partners are present to remove the
acetate produced via anaerobic respiration.
4Bacterial Genome Sequencing Explodes
- 341 completed genomes, 976 ongoing
- Partial genome sequences released in just days
now by JGI! - How do we use sequence information to understand
how all these organisms function in the
environment? - Annotation is the key, but is now largely
automated and hence of lower quality
5What is Annotation?
AGCTTAACTGGGATACGACGACCAGTAGACAGGTRTACGATGAGATATAT
AT
Locate genes
Translate to proteins
MASDLKKIYTRPRPDSAWQECVAALFDGHSKDKLACNDDL
Gather Evidence of function
Assign putative functions
6Annotation Drives Post-genomic Research
Function predictions
Methodologies
Data
Interpretation
Gene predictions
DNA microarrays
mRNA expression
Metabolic modeling
ChiP-Chip
DNA binding sites
Protein predictions
Proteomics
Protein expression
Hypothesis
Targeted gene knock-outs
7Annotation with Gnare/Puma2
- Developed at Argonne National Laboratory by
Natalia Maltsev, Mark DSouza, Elizabeth Glass,
Dina Sulakhe, Mustafa Syed, Pavan Anumula - http//compbio.mcs.anl.gov/puma2/cgi-bin/index.cgi
- Gnare Private genome sequences
- Puma2 Public genome sequences
8Types of Functional Descriptors
- Hypothetical protein
- Conserved hypothetical protein
- Conserved domain protein
- Function associated protein
- Class specific enzyme
- Specific function predicted
- Function validated
9Checking Functions Where No Domain Hit Occurs
type IV secretion outer membrane protein, PilW?
10MKNCQKG
11Clues in Interpro Domain Descriptor
This is a family of hypothetical proteins. A
number of the sequence records state they are
transmembrane proteins or putative permeases. It
is not clear what source suggested that these
proteins might be permeases and this information
should be treated with caution.
autoinducer-2 transport protein, TqsA
2.A.86 The Autoinducer-2
Exporter (AI-2E) Family The AI-2E family
(UPF0118) is a large family of prokaryotic
proteins derived from a variety of bacteria and
archaea. Those examined are about 350 residues in
length, and the couple that have been examined
exhibit 7 putative transmembrane a-helical
spanners (TMSs). E. coli, B. subtilis and several
other prokaryotes have multiple paralogues
encoded within their genomes. Herzberg et al.
(2006) have presented strong evidence for a role
of a AI-2E family homologue, YdgG (renamed TqsA),
as an exporter of the E. coli autoinducer-2
(AI-2) (Camilli and Bassler, 2006 Chen et al.,
2002). AI-2 is a proposed signalling molecule for
interspecies communication in bacteria. It is a
furanosyl borate diester (Chen et al., 2002).
12(No Transcript)
13Clusters with N-acetyl glucosame catabolic enzymes
Hypothesis experimentally validated!
14(No Transcript)
15Relevant abstracts mentioning your query species
(Shewanella oneidensis)
sulfite dehydrogenase catalytic molybdopterin
subunit, SorA
16Mistake in Interpro Database found!
17More Automation in Evidence Collecting Needed
18Protein Location Linked to Function
19Multiple Routes of Secretion
LepB
LepB
LspA
F E
G
PilD
GG
C39
20Bioinformatics Tools for Localization Prediction
- Incorrect start sites have strong impact on
predictions! - Different tools have unique specialties
- No one tool provides good predictions for all
proteins
LepB
IM TM
Psort LipoP Predsi Phobius SignalP TatP
Sosui TmHMM Phobius Psort HMMTOP
LspA
b barrel
SubLoc Cello Psort Secretome
LipoP Lipo Psort
ProfTMB Bomp BBTM
21Example c type cytochromes
- Contain CXXCH motif for binding heme
- so do some other proteins that are not c type
cytochromes ? - All are secreted across the inner membrane and
then assembled - 60 proteins in MR-1 have CXXCH
- Only 43 have a leader peptide and are predicted
to be c type cytochromes
22Future Needs in Annotation Automation
- Current methods of automated annotation will lead
to propagation of annotation errors and burying
of useful evidence - But manual annotation cannot keep up with rate at
which sequences are produced - Additional automations are needed!
- Protein localization
- Specialty database mining (TCDB, merops, etc)
- Experimental data mining appropriate databases
dont exist