Title: Genome Annotation
1Genome Annotation
- Bioinformatics 301
- David Wishart
- david.wishart_at_ualberta.ca
- Notes at http//redpoll.pharmacy.ualberta.ca
2DNA Sequencing
Isolate Chromosome
ShearDNA into Fragments
Clone into Seq. Vectors
Sequence
3Sequence Assembly
Assembled Sequence
Sequence Chromatogram
Send to Computer
4Genome Sequence
gtP12345 Yeast chromosome1 GATTACAGATTACAGATTACAGAT
TACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA
TTACAGATTACAGATTACAGATTACAGATTACAGAT TACAGATTAGAGA
TTACAGATTACAGATTACAGATT ACAGATTACAGATTACAGATTACAGA
TTACAGATTA CAGATTACAGATTACAGATTACAGATTACAGATTAC AG
ATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAG
ATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAG
ATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT
5Predict Genes
6The Result
gtP12346 Sequence 1 ATGTACAGATTACAGATTACAGATTACAGAT
TACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGA
TTACAGATTACAGATTACAGAT gtP12347 Sequence
2 ATGAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAG
ATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAG
ATTACAGATT gtP12348 Sequence 3 ATGTTACAGATTACAGATT
ACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACA...
7Is This Annotated?
gtP12346 Sequence 1 ATGTACAGATTACAGATTACAGATTACAGAT
TACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGA
TTACAGATTACAGATTACAGAT gtP12347 Sequence
2 ATGAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAG
ATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAG
ATTACAGATT gtP12348 Sequence 3 ATGTTACAGATTACAGATT
ACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACA...
8How About This?
gtP12346 Sequence 1 MEKGQASRTDHNMCLKPGAAERTPESTSPAS
DAAGG IPQNLKGFYQALNNWLKDSQLKPPPSSGTREWAALK LPNTHIA
LD gtP12347 Sequence 2 MKPQRTLNASELVISLIVESINTHISH
OUSEPLEAS EWILLITALLCEASE gtP12348 Sequence
3 MQWERTGHFDALKPQWERTYHEREISANTHERS...
9Gene Annotation
- Annotation to identify and describe all the
physico-chemical, functional and structural
properties of a gene including its DNA sequence,
protein sequence, sequence corrections, name(s),
position, function(s), abundance, location, mass,
pI, absorptivity, solubility, active sites,
binding sites, reactions, substrates, homologues,
2o structure, 3D structure, domains, pathways,
interacting partners
10Gene Annotation
Protein Annotation
11Protein/Gene vs. Proteome/Genome Annotation
- Gene/Protein annotation is concerned with one or
a small number (lt50) genes or proteins from one
or several types of organisms - Genome/Proteome annotation is concerned with
entire proteomes (gt2000 proteins) from a specific
organism (or for all organisms) - need for speed
12Different Levels of Annotation
- Sparse typical of archival databanks like
GenBank, usually just includes name, depositor,
accession number, dates, ID - Moderate typical of many curated protein
sequence databanks (PIR, TrEMBL) - Detailed not typical (occasionally found in
organism-specific databases)
13Different Levels of Database Annotation
- GenBank (large of sequences, minimal
annotation) - PIR (large of sequences, slightly better
annotation) - SwissProt (small of sequences, even better
annotation) - Organsim-specific DB (very small of sequences,
best annotation)
14GenBank Annotation
15PIR Annotation
16Swiss-Prot Annotation
17The CCDB
http//redpoll.pharmacy.ualberta.ca/CCDB
18CCDB Annotation
19CCDB Annotation
20CCDB Contents
- Functional info (predicted or known)
- Sequence information (sites, modifications, pI,
MW, cleavage) - Location information (in chromosome cell)
- Interacting partners (known predicted)
- Structure (2o, 3o, 4o, predicted)
- Enzymatic rate and binding constants
- Abundance, copy number, concentration
- Links to other sites viewing tools
- Integrated version of all major Dbs
- 70 fields for each entry
21Searching Capabilities
- Text search, BLAST search, SQL search
- Show all membrane proteins that are essential
and have more than 6 membrane spanning regions - Chemical Structure search
- Find all metabolites similar to this prospective
drug structure
22Ultimate Goal...
- To achieve the same level of protein/proteome
annotation as found in CCDB for all
genes/proteins -- automatically
How?
23Annotation Methods
- Annotation by homology (BLAST)
- requires a large, well annotated database of
protein sequences - Annotation by sequence composition
- simple statistical/mathematical methods
- Annotation by sequence features, profiles or
motifs - requires sophisticated sequence analysis tools
24Annotation by Homology
- Statistically significant sequence matches
identified by BLAST searches against GenBank
(nr), SWISS-PROT, PIR, ProDom, BLOCKS, KEGG, WIT,
Brenda, BIND - Properties or annotation inferred by name,
keywords, features, comments
Databases Are Key
25Sequence Databases
- GenBank
- www.ncbi.nlm.nih.gov/
- EMBL/trEMBL
- www.ebi.ac.uk/trembl/
- DDBJ
- www.nig.ac.jp/
- PIR
- http//pir.georgetown.edu/
- SwissProt
- www.expasy.ch/sprot/
26Structure Databases
- RCSB-PDB
- http//www.rcsb.org/pdb/
- MSD
- http//www.ebi.ac.uk/msd/index.html
- CATH
- www.biochem.ucl.ac.uk/bsm/cath/
- SCOP
- www.scop.mrc-lmb.cam.ac.uk/scop/
27Expression Databases
- Swiss 2D Page
- http//ca.expasy.org/ch2d/
- SMD
- http//genome-www5.stanford.edu/MicroArray/SMD/
- ArrayExpress
- http//www.ebi.ac.uk/arrayexpress/
- Gene Expr. Omnibus
- http//www.ncbi.nlm.nih.gov/geo/
28Metabolism Databases
- KEGG
- http//www.genome.ad.jp/kegg/metabolism.html
- Roche/Boeringer
- http//www.expasy.org/cgi-bin/search-biochem-index
- EcoCyc
- www.ecocyc.org/
- WIT
- http//wit.mcs.anl.gov/WIT2/
29Interaction Databases
- BIND
- http//www.bind.ca/
- DIP
- http//dip.doe-mbi.ucla.edu/
- PIM
- http//www.hybrigenics.fr/
- PathCalling
- http//portal.curagen.com/extpc/com.curagen.portal
.servlet.Yeast
30Bibliographic Databases
- PubMed Medline
- http//www.ncbi.nlm.nih.gov/PubMed/
- Science Citation Index
- http//isi4.isiknowledge.com/portal.cgi
- Your Local eLibrary
- www.XXXX.ca
- Current Contents
- http//www.isinet.com/isi/
31Annotation by HomologyAn Example
- 76 residue protein from Methanobacter
thermoautotrophicum (newly sequenced) - What does it do?
- MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTA
LPGLAVDGELKIMGRVASKEEIKKILS
32PSI BLAST
Select Database
33PSI-BLAST
34PSI-BLAST
35PSI-BLAST
36Conclusions
- Protein is a thioredoxin or glutaredoxin
(function, family) - Protein has thioredoxin fold (2o and 3D
structure) - Active site is from residues 11-14 (active site
location) - Protein is soluble, cytoplasmic (cellular
location)
37Annotation Methods
- Annotation by homology (BLAST)
- requires a large, well annotated database of
protein sequences - Annotation by sequence composition
- simple statistical/mathematical methods
- Annotation by sequence features, profiles or
motifs - requires sophisticated sequence analysis tools
38Annotation by Composition
- Molecular Weight
- Isoelectric Point
- UV Absorptivity
- Hydrophobicity
39Where To Go
40Molecular Weight
41Molecular Weight
- Useful for SDS PAGE and 2D gel analysis
- Useful for deciding on SEC matrix
- Useful for deciding on MWC for dialysis
- Essential in synthetic peptide analysis
- Essential in peptide sequencing (classical or
mass-spectrometry based) - Essential in proteomics and high throughput
protein characterization
42Molecular Weight
- Crude MW calculation MW 110 X Numres
- Exact MW calculation MW SAAi x MWi
- Remember to add 1 water (18.01 amu) after adding
all res. - Note isotopic weights
- Corrections for CHO, PO4, Acetyl, CONH2
43Amino Acid versus Residue
R
R
C
C
CO
N
COOH
H2N
H
H
H
Amino Acid Residue
44Protein Identification via MW
- MOWSE
- http//srs.hgmp.mrc.ac.uk/cgi-bin/mowse
- CombSearch
- http//ca.expasy.org/tools/CombSearch/
- Mascot
- http//www.matrixscience.com/search_form_select.ht
ml - AACompSim/AACompIdent
- http//ca.expasy.org/tools/
45Molecular Weight Proteomics
2-D Gel QTOF Mass Spectrometry
46Isoelectric Point
- The pH at which a protein has a net charge0
- Q S Ni/(1 10pH-pKi)
47UV Absorptivity
- OD280 (5690 x W 1280 x Y)/MW x Conc.
- Conc. OD280 x MW/(5690 X W 1280 x Y)
OH
N
48Hydrophobicity
- Indicates Solubility
- Indicates Stability
- Indicates Location (membrane or cytoplasm)
- Indicates Globularity or tendency to form
spherical structure
49Annotation Methods
- Annotation by homology (BLAST)
- requires a large, well annotated database of
protein sequences - Annotation by sequence composition
- simple statistical/mathematical methods
- Annotation by sequence features, profiles or
motifs - requires sophisticated sequence analysis tools
50Where To Go
51Sequence Feature Databases
- PROSITE - http//www.expasy.ch/
- BLOCKS - http//www.blocks.fhcrc.org/
- DOMO - http//www.infobiogen.fr/gracy/domo
- PFAM - http//pfam.wustl.edu
- PRINTS - http//www.biochem.ucl.ac.uk/bsm/dbrowser
/PRINTS - SEQSITE - PepTool
52What Can Be Predicted?
- O-Glycosylation Sites
- Phosphorylation Sites
- Protease Cut Sites
- Nuclear Targeting Sites
- Mitochondrial Targ Sites
- Chloroplast Targ Sites
- Signal Sequences
- Signal Sequence Cleav.
- Peroxisome Targ Sites
- ER Targeting Sites
- Transmembrane Sites
- Tyrosine Sulfation Sites
- GPInositol Anchor Sites
- PEST sites
- Coil-Coil Sites
- T-Cell/MHC Epitopes
- Protein Lifetime
- A whole lot more.
53Cutting Edge Sequence Feature Servers
- Membrane Helix Prediction
- http//www.cbs.dtu.dk/services/TMHMM-2.0/
- T-Cell Epitope Prediction
- http//syfpeithi.bmi-heidelberg.com/scripts/MHCSer
ver.dll/home.htm - O-Glycosylation Prediction
- http//www.cbs.dtu.dk/services/NetOGlyc/
- Phosphorylation Prediction
- http//www.cbs.dtu.dk/services/NetPhos/
- Protein Localization Prediction
- http//psort.nibb.ac.jp/
542o Structure Prediction
- PredictProtein-PHD (72)
- http//cubic.bioc.columbia.edu/predictprotein
- Jpred (73-75)
- http//jura.ebi.ac.uk8888/
- PREDATOR (75)
- http//www.embl-heidelberg.de/cgi/predator_serv.pl
- PSIpred (77)
- http//bioinf.cs.ucl.ac.uk/psipred/
55Putting It All Together
Seq Motifs
Composition
Homology
56Putting It All Together
- PEDANT
- http//pedant.gsf.de/publ.html
- GeneQuiz
- http//jura.ebi.ac.uk8765/ext-genequiz/
- Magpie
- http//magpie.ucalgary.ca/
- Proteome Analyst
- http//www.cs.ualberta.ca/bioinfo/PA/
57(No Transcript)
58Programs Used By Pedant
- HMMER
- PSORT
- PREDATOR
- COILS
- FGENESH
- pI
- PROSEARCH
- TargetP
- SAPS
- NCBI-BLAST
- SEG
- InterProScan
- SignalP
- TMHMM
- tRNAscan-SE
- GENSCAN
59Databases Used By Pedant
- EMBL
- PIR-PSD
- SWISS-PROT
- Functional Cat
- PROSITE
- TrEMBL
- Blocks
- PDB
- SCOP
- COGs
- Pfam
- STRIDE
60MIPS Functional Categories
- Metabolism
- Energy
- Cell cycle and DNA processing
- Transcription
- Protein synthesis
- Protein fate (folding, modification,destinat
ion) - Cellular transport and transport mechanisms
- Cellular communication signal
transduction - Cell rescue, defense and virulence
- Organ differentiation
- Subcellular localisation
- Cell type localisation
61MIPS Functional Categories
- Regulation of and interaction with
cellular environment - Cell fate
- Systematic regulation of interaction with
environment - Development
- Transposable elements, viral and plasmid
proteins - Control of cellular organization
- Cell type differentiation
- Tissue differentiation
- Organ differentiation
- Subcellular localisation
- Cell type localisation
62MIPS Functional Categories
- Tissue localisation
- Organ localisation
- Ubiquitous expression
- Protein activity regulation
- Protein with binding function or cofactor
requirement - Storage protein
- Transport facilitation
- Out of use categories
- Classification not yet clear-cut
- Unclassified proteins
63http//magpie.ucalgary.ca/
64(No Transcript)
65(No Transcript)
66(No Transcript)
67http//www.cs.ualberta.ca/bioinfo/PA/
68Proteome Analyst
- Uses PSI-BLAST, PSI-PRED and motif analysis tools
- Extracts keyword information from homologues and
uses Naïve Bayes classifiers to infer function - Combines sequence motif and sequence profile
information to complete functional classification - Supports custom classifier/ontology
69Final Output
70Graphical Output
71BacMap
- Picking up where we left off with the CCDB
(Google bacmap) - Idea is to generate a visual atlas of all (not
just Escherichia coli) bacterial chromosomes and
plasmids but with links to extensive genome
annotation - Attempt to re-use annotation and graphing tools
originally developed for the CCDB
72BacMap
http//wishart.biology.ualberta.ca/BacMap/
73BacMap
74Text Search Tools
75Sequence Search Tools
76Bacterial Biography Card
77Genome Statistics
78Proteome Statistics
79BacMap
- Each genome has a short description of the
organism and sequence data - Supports zoomable, hyperlinked, clickable map
views of the genome - Supports text search of gene names, protein names
and synonyms - Supports BLAST search and supplies genome-wide
stats - Currently going through major update
Stothard P, et al. BacMap an interactive picture
atlas of annotated bacterial genomes. Nucleic
Acids Res. 2005 Jan 133 Database IssueD317-20.
80What if Your Organism or Genome isnt in BacMap?
http//wishart.biology.ualberta.ca/basys/
81BASys
- Bacterial Annotation System
- A publicly available web server that performs
automated annotation of bacterial genomes given
only the gene sequence of a chromosome or plasmid - Takes about 24 hrs for an average genome (4
megabases) - Output includes images and annotation text (about
70 fields for each gene)
82Typical BASys Result
83Conclusion
- Genome annotation is the same as proteome
annotation required after any gene sequencing
and gene ID effort - Can be done either manually or automatically
- Need for high throughput, automated pipelines
to keep up with the volume of genome sequence
data - Area of active research and development with
about ½ of all bioinformaticians working on some
aspect of this process