Genome Annotation - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Genome Annotation

Description:

Annotation to identify and describe all the physico-chemical, ... Roche/Boeringer. http://www.expasy.org/cgi-bin/search-biochem-index. EcoCyc. www.ecocyc.org ... – PowerPoint PPT presentation

Number of Views:1559
Avg rating:3.0/5.0
Slides: 84
Provided by: Comp632
Category:

less

Transcript and Presenter's Notes

Title: Genome Annotation


1
Genome Annotation
  • Bioinformatics 301
  • David Wishart
  • david.wishart_at_ualberta.ca
  • Notes at http//redpoll.pharmacy.ualberta.ca

2
DNA Sequencing
Isolate Chromosome
ShearDNA into Fragments
Clone into Seq. Vectors
Sequence
3
Sequence Assembly
Assembled Sequence
Sequence Chromatogram
Send to Computer
4
Genome Sequence
gtP12345 Yeast chromosome1 GATTACAGATTACAGATTACAGAT
TACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA
TTACAGATTACAGATTACAGATTACAGATTACAGAT TACAGATTAGAGA
TTACAGATTACAGATTACAGATT ACAGATTACAGATTACAGATTACAGA
TTACAGATTA CAGATTACAGATTACAGATTACAGATTACAGATTAC AG
ATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAG
ATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAG
ATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT
5
Predict Genes
6
The Result
gtP12346 Sequence 1 ATGTACAGATTACAGATTACAGATTACAGAT
TACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGA
TTACAGATTACAGATTACAGAT gtP12347 Sequence
2 ATGAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAG
ATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAG
ATTACAGATT gtP12348 Sequence 3 ATGTTACAGATTACAGATT
ACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACA...
7
Is This Annotated?
gtP12346 Sequence 1 ATGTACAGATTACAGATTACAGATTACAGAT
TACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGA
TTACAGATTACAGATTACAGAT gtP12347 Sequence
2 ATGAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAG
ATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAG
ATTACAGATT gtP12348 Sequence 3 ATGTTACAGATTACAGATT
ACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACA...
8
How About This?
gtP12346 Sequence 1 MEKGQASRTDHNMCLKPGAAERTPESTSPAS
DAAGG IPQNLKGFYQALNNWLKDSQLKPPPSSGTREWAALK LPNTHIA
LD gtP12347 Sequence 2 MKPQRTLNASELVISLIVESINTHISH
OUSEPLEAS EWILLITALLCEASE gtP12348 Sequence
3 MQWERTGHFDALKPQWERTYHEREISANTHERS...
9
Gene Annotation
  • Annotation to identify and describe all the
    physico-chemical, functional and structural
    properties of a gene including its DNA sequence,
    protein sequence, sequence corrections, name(s),
    position, function(s), abundance, location, mass,
    pI, absorptivity, solubility, active sites,
    binding sites, reactions, substrates, homologues,
    2o structure, 3D structure, domains, pathways,
    interacting partners

10
Gene Annotation
Protein Annotation
11
Protein/Gene vs. Proteome/Genome Annotation
  • Gene/Protein annotation is concerned with one or
    a small number (lt50) genes or proteins from one
    or several types of organisms
  • Genome/Proteome annotation is concerned with
    entire proteomes (gt2000 proteins) from a specific
    organism (or for all organisms) - need for speed

12
Different Levels of Annotation
  • Sparse typical of archival databanks like
    GenBank, usually just includes name, depositor,
    accession number, dates, ID
  • Moderate typical of many curated protein
    sequence databanks (PIR, TrEMBL)
  • Detailed not typical (occasionally found in
    organism-specific databases)

13
Different Levels of Database Annotation
  • GenBank (large of sequences, minimal
    annotation)
  • PIR (large of sequences, slightly better
    annotation)
  • SwissProt (small of sequences, even better
    annotation)
  • Organsim-specific DB (very small of sequences,
    best annotation)

14
GenBank Annotation
15
PIR Annotation
16
Swiss-Prot Annotation
17
The CCDB
http//redpoll.pharmacy.ualberta.ca/CCDB
18
CCDB Annotation
19
CCDB Annotation
20
CCDB Contents
  • Functional info (predicted or known)
  • Sequence information (sites, modifications, pI,
    MW, cleavage)
  • Location information (in chromosome cell)
  • Interacting partners (known predicted)
  • Structure (2o, 3o, 4o, predicted)
  • Enzymatic rate and binding constants
  • Abundance, copy number, concentration
  • Links to other sites viewing tools
  • Integrated version of all major Dbs
  • 70 fields for each entry

21
Searching Capabilities
  • Text search, BLAST search, SQL search
  • Show all membrane proteins that are essential
    and have more than 6 membrane spanning regions
  • Chemical Structure search
  • Find all metabolites similar to this prospective
    drug structure

22
Ultimate Goal...
  • To achieve the same level of protein/proteome
    annotation as found in CCDB for all
    genes/proteins -- automatically

How?
23
Annotation Methods
  • Annotation by homology (BLAST)
  • requires a large, well annotated database of
    protein sequences
  • Annotation by sequence composition
  • simple statistical/mathematical methods
  • Annotation by sequence features, profiles or
    motifs
  • requires sophisticated sequence analysis tools

24
Annotation by Homology
  • Statistically significant sequence matches
    identified by BLAST searches against GenBank
    (nr), SWISS-PROT, PIR, ProDom, BLOCKS, KEGG, WIT,
    Brenda, BIND
  • Properties or annotation inferred by name,
    keywords, features, comments

Databases Are Key
25
Sequence Databases
  • GenBank
  • www.ncbi.nlm.nih.gov/
  • EMBL/trEMBL
  • www.ebi.ac.uk/trembl/
  • DDBJ
  • www.nig.ac.jp/
  • PIR
  • http//pir.georgetown.edu/
  • SwissProt
  • www.expasy.ch/sprot/

26
Structure Databases
  • RCSB-PDB
  • http//www.rcsb.org/pdb/
  • MSD
  • http//www.ebi.ac.uk/msd/index.html
  • CATH
  • www.biochem.ucl.ac.uk/bsm/cath/
  • SCOP
  • www.scop.mrc-lmb.cam.ac.uk/scop/

27
Expression Databases
  • Swiss 2D Page
  • http//ca.expasy.org/ch2d/
  • SMD
  • http//genome-www5.stanford.edu/MicroArray/SMD/
  • ArrayExpress
  • http//www.ebi.ac.uk/arrayexpress/
  • Gene Expr. Omnibus
  • http//www.ncbi.nlm.nih.gov/geo/

28
Metabolism Databases
  • KEGG
  • http//www.genome.ad.jp/kegg/metabolism.html
  • Roche/Boeringer
  • http//www.expasy.org/cgi-bin/search-biochem-index
  • EcoCyc
  • www.ecocyc.org/
  • WIT
  • http//wit.mcs.anl.gov/WIT2/

29
Interaction Databases
  • BIND
  • http//www.bind.ca/
  • DIP
  • http//dip.doe-mbi.ucla.edu/
  • PIM
  • http//www.hybrigenics.fr/
  • PathCalling
  • http//portal.curagen.com/extpc/com.curagen.portal
    .servlet.Yeast

30
Bibliographic Databases
  • PubMed Medline
  • http//www.ncbi.nlm.nih.gov/PubMed/
  • Science Citation Index
  • http//isi4.isiknowledge.com/portal.cgi
  • Your Local eLibrary
  • www.XXXX.ca
  • Current Contents
  • http//www.isinet.com/isi/

31
Annotation by HomologyAn Example
  • 76 residue protein from Methanobacter
    thermoautotrophicum (newly sequenced)
  • What does it do?
  • MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTA
    LPGLAVDGELKIMGRVASKEEIKKILS

32
PSI BLAST
Select Database
33
PSI-BLAST
34
PSI-BLAST
35
PSI-BLAST
36
Conclusions
  • Protein is a thioredoxin or glutaredoxin
    (function, family)
  • Protein has thioredoxin fold (2o and 3D
    structure)
  • Active site is from residues 11-14 (active site
    location)
  • Protein is soluble, cytoplasmic (cellular
    location)

37
Annotation Methods
  • Annotation by homology (BLAST)
  • requires a large, well annotated database of
    protein sequences
  • Annotation by sequence composition
  • simple statistical/mathematical methods
  • Annotation by sequence features, profiles or
    motifs
  • requires sophisticated sequence analysis tools

38
Annotation by Composition
  • Molecular Weight
  • Isoelectric Point
  • UV Absorptivity
  • Hydrophobicity

39
Where To Go
40
Molecular Weight
41
Molecular Weight
  • Useful for SDS PAGE and 2D gel analysis
  • Useful for deciding on SEC matrix
  • Useful for deciding on MWC for dialysis
  • Essential in synthetic peptide analysis
  • Essential in peptide sequencing (classical or
    mass-spectrometry based)
  • Essential in proteomics and high throughput
    protein characterization

42
Molecular Weight
  • Crude MW calculation MW 110 X Numres
  • Exact MW calculation MW SAAi x MWi
  • Remember to add 1 water (18.01 amu) after adding
    all res.
  • Note isotopic weights
  • Corrections for CHO, PO4, Acetyl, CONH2

43
Amino Acid versus Residue
R
R
C
C
CO
N
COOH
H2N
H
H
H
Amino Acid Residue
44
Protein Identification via MW
  • MOWSE
  • http//srs.hgmp.mrc.ac.uk/cgi-bin/mowse
  • CombSearch
  • http//ca.expasy.org/tools/CombSearch/
  • Mascot
  • http//www.matrixscience.com/search_form_select.ht
    ml
  • AACompSim/AACompIdent
  • http//ca.expasy.org/tools/

45
Molecular Weight Proteomics
2-D Gel QTOF Mass Spectrometry
46
Isoelectric Point
  • The pH at which a protein has a net charge0
  • Q S Ni/(1 10pH-pKi)

47
UV Absorptivity
  • OD280 (5690 x W 1280 x Y)/MW x Conc.
  • Conc. OD280 x MW/(5690 X W 1280 x Y)

OH
N
48
Hydrophobicity
  • Indicates Solubility
  • Indicates Stability
  • Indicates Location (membrane or cytoplasm)
  • Indicates Globularity or tendency to form
    spherical structure

49
Annotation Methods
  • Annotation by homology (BLAST)
  • requires a large, well annotated database of
    protein sequences
  • Annotation by sequence composition
  • simple statistical/mathematical methods
  • Annotation by sequence features, profiles or
    motifs
  • requires sophisticated sequence analysis tools

50
Where To Go
51
Sequence Feature Databases
  • PROSITE - http//www.expasy.ch/
  • BLOCKS - http//www.blocks.fhcrc.org/
  • DOMO - http//www.infobiogen.fr/gracy/domo
  • PFAM - http//pfam.wustl.edu
  • PRINTS - http//www.biochem.ucl.ac.uk/bsm/dbrowser
    /PRINTS
  • SEQSITE - PepTool

52
What Can Be Predicted?
  • O-Glycosylation Sites
  • Phosphorylation Sites
  • Protease Cut Sites
  • Nuclear Targeting Sites
  • Mitochondrial Targ Sites
  • Chloroplast Targ Sites
  • Signal Sequences
  • Signal Sequence Cleav.
  • Peroxisome Targ Sites
  • ER Targeting Sites
  • Transmembrane Sites
  • Tyrosine Sulfation Sites
  • GPInositol Anchor Sites
  • PEST sites
  • Coil-Coil Sites
  • T-Cell/MHC Epitopes
  • Protein Lifetime
  • A whole lot more.

53
Cutting Edge Sequence Feature Servers
  • Membrane Helix Prediction
  • http//www.cbs.dtu.dk/services/TMHMM-2.0/
  • T-Cell Epitope Prediction
  • http//syfpeithi.bmi-heidelberg.com/scripts/MHCSer
    ver.dll/home.htm
  • O-Glycosylation Prediction
  • http//www.cbs.dtu.dk/services/NetOGlyc/
  • Phosphorylation Prediction
  • http//www.cbs.dtu.dk/services/NetPhos/
  • Protein Localization Prediction
  • http//psort.nibb.ac.jp/

54
2o Structure Prediction
  • PredictProtein-PHD (72)
  • http//cubic.bioc.columbia.edu/predictprotein
  • Jpred (73-75)
  • http//jura.ebi.ac.uk8888/
  • PREDATOR (75)
  • http//www.embl-heidelberg.de/cgi/predator_serv.pl
  • PSIpred (77)
  • http//bioinf.cs.ucl.ac.uk/psipred/

55
Putting It All Together
Seq Motifs
Composition
Homology
56
Putting It All Together
  • PEDANT
  • http//pedant.gsf.de/publ.html
  • GeneQuiz
  • http//jura.ebi.ac.uk8765/ext-genequiz/
  • Magpie
  • http//magpie.ucalgary.ca/
  • Proteome Analyst
  • http//www.cs.ualberta.ca/bioinfo/PA/

57
(No Transcript)
58
Programs Used By Pedant
  • HMMER
  • PSORT
  • PREDATOR
  • COILS
  • FGENESH
  • pI
  • PROSEARCH
  • TargetP
  • SAPS
  • NCBI-BLAST
  • SEG
  • InterProScan
  • SignalP
  • TMHMM
  • tRNAscan-SE
  • GENSCAN

59
Databases Used By Pedant
  • EMBL
  • PIR-PSD
  • SWISS-PROT
  • Functional Cat
  • PROSITE
  • TrEMBL
  • Blocks
  • PDB
  • SCOP
  • COGs
  • Pfam
  • STRIDE

60
MIPS Functional Categories
  • Metabolism
  • Energy
  • Cell cycle and DNA processing
  • Transcription
  • Protein synthesis
  • Protein fate (folding, modification,destinat
    ion)
  • Cellular transport and transport mechanisms
  • Cellular communication signal
    transduction
  • Cell rescue, defense and virulence
  • Organ differentiation
  • Subcellular localisation
  • Cell type localisation

61
MIPS Functional Categories
  • Regulation of and interaction with
    cellular environment
  • Cell fate
  • Systematic regulation of interaction with
    environment
  • Development
  • Transposable elements, viral and plasmid
    proteins
  • Control of cellular organization
  • Cell type differentiation
  • Tissue differentiation
  • Organ differentiation
  • Subcellular localisation
  • Cell type localisation

62
MIPS Functional Categories
  • Tissue localisation
  • Organ localisation
  • Ubiquitous expression
  • Protein activity regulation
  • Protein with binding function or cofactor
    requirement
  • Storage protein
  • Transport facilitation
  • Out of use categories
  • Classification not yet clear-cut
  • Unclassified proteins

63
http//magpie.ucalgary.ca/
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
http//www.cs.ualberta.ca/bioinfo/PA/
68
Proteome Analyst
  • Uses PSI-BLAST, PSI-PRED and motif analysis tools
  • Extracts keyword information from homologues and
    uses Naïve Bayes classifiers to infer function
  • Combines sequence motif and sequence profile
    information to complete functional classification
  • Supports custom classifier/ontology

69
Final Output
70
Graphical Output
71
BacMap
  • Picking up where we left off with the CCDB
    (Google bacmap)
  • Idea is to generate a visual atlas of all (not
    just Escherichia coli) bacterial chromosomes and
    plasmids but with links to extensive genome
    annotation
  • Attempt to re-use annotation and graphing tools
    originally developed for the CCDB

72
BacMap
http//wishart.biology.ualberta.ca/BacMap/
73
BacMap
74
Text Search Tools
75
Sequence Search Tools
76
Bacterial Biography Card
77
Genome Statistics
78
Proteome Statistics
79
BacMap
  • Each genome has a short description of the
    organism and sequence data
  • Supports zoomable, hyperlinked, clickable map
    views of the genome
  • Supports text search of gene names, protein names
    and synonyms
  • Supports BLAST search and supplies genome-wide
    stats
  • Currently going through major update

Stothard P, et al. BacMap an interactive picture
atlas of annotated bacterial genomes. Nucleic
Acids Res. 2005 Jan 133 Database IssueD317-20.
80
What if Your Organism or Genome isnt in BacMap?
http//wishart.biology.ualberta.ca/basys/
81
BASys
  • Bacterial Annotation System
  • A publicly available web server that performs
    automated annotation of bacterial genomes given
    only the gene sequence of a chromosome or plasmid
  • Takes about 24 hrs for an average genome (4
    megabases)
  • Output includes images and annotation text (about
    70 fields for each gene)

82
Typical BASys Result
83
Conclusion
  • Genome annotation is the same as proteome
    annotation required after any gene sequencing
    and gene ID effort
  • Can be done either manually or automatically
  • Need for high throughput, automated pipelines
    to keep up with the volume of genome sequence
    data
  • Area of active research and development with
    about ½ of all bioinformaticians working on some
    aspect of this process
Write a Comment
User Comments (0)
About PowerShow.com