Title: GENOME ANNOTATION AND FUNCTIONAL GENOMICS The protein sequence perspective
1GENOME ANNOTATION AND FUNCTIONAL GENOMICSThe
protein sequence perspective
2GENOME ANNOTATION
- Two main levels
- STRUCTURAL ANNOTATION Finding genes and other
biologically relevant sites thus building up a
model of genome as objects with specific
locations - FUNCTIONAL ANNOTATION Objects are used in
database searches (and expts) aim is attributing
biologically relevant information to whole
sequence and individual objects
3WHY PROTEIN RATHER THAN DNA?
- Larger alphabet -more sensitive comparisons
- Protein sequences lower signal to noise ratio
- Less redundancy and no frameshifts
- Each aa has different properties like size,
charge etc - Closer to biological function
- 3D structure of similar proteins may be known
- Evolutionary relationships more evident
- Availability of good, well annotated protein
sequence and pattern databases
4Large-scale genome analysis projects
- Rate-limiting step is annotation
- Whole genome availability provides context
information - Main goal is to bridge gap between genotype and
phenotype
5Definitions of Annotation
- Addition of as much reliable and up-to-date
information as possible to describe a sequence - Identification, structural description,
characterisation of putative protein products and
other features in primary genomic sequence - Information attached to genomic coordinates with
start and end point, can occur at different
levels - Interpreting raw sequence data into useful
biological information
6ANNOTATION/FUNCTION CAN BE MAPPED TO DIFFERENT
LEVELS
- ? ORGANISM -phenotypic function (morphology,
physiology, behaviour, environemntal response),
context NB - ? CELLULAR -metabolic pathway, signal cascades,
cellular localisation. Context dependent - ? MOLECULAR -binding sites, catalytic activity,
PTM, 3D structure - ? DOMAIN
- ? SINGLE RESIDUE
7Annotation is the description of
- Function(s) of the protein
- Post-translational modification(s)
- Domains and sites
- Secondary structure
- Quaternary structure
- Similarities to other proteins
- Disease(s) associated with deficiencie(s) in the
protein - Sequence conflicts, variants, etc.
8Additional information for proteins
- ALTERNATIVE PRODUCTS
- CATALYTIC ACTIVITY
- COFACTOR
- DEVELOPMENTAL STAGE
- DISEASE
- DOMAIN
- ENZYME REGULATION
- FUNCTION
- INDUCTION
- PATHWAY
- PHARMACEUTICALS
- POLYMORPHISM
- PTM
- SIMILARITY
- SUBCELLULAR LOCATION
- SUBUNIT
- TISSUE SPECIFICITY
9Amino-acid sites are
- Post-translational modification of a residue
- Covalent binding of a lipidic moiety
- Disulfide bond
- Thiolester bond
- Thioether bond
- Glycosylation site
- Binding site for a metal ion
- Binding site for any chemical group (co-enzyme,
prosthetic group, etc.)
10Regions
- SIGNAL SEQUENCE
- TRANSIT PEPTIDE
- PROPEPTIDE
- CHAIN
- PEPTIDE
- DOMAIN
- ACTIVE SITE
- DNA BIND SITE
- METAL BIND SITE
- MOLECULE BIND SITE
- TRANSMEMBRANE
11Annotation sources
- publications that report new sequence data
- review articles to periodically update the
annotation of families or groups of proteins - external experts
- protein sequence analysis
12Approaches to functional annotation
- Automatic annotation (sequence homology, rules,
transfer info from pdb) - Automatic classification (pattern databases,
clustering, structure) - Automatic characterisation (functional databases)
- Context information (comparitive genome analysis,
metabolic pathway databases) - Experimental results (2D gels, microarrays)
- Full manual annotation (SWISS-PROT style)
13PROTEIN SEQUENCE ANALYSIS
- Protein sequence can come from gene predictions,
literature or peptide sequencing - Analysis on different levels
- molecular
- cellular
- organism
- Simplest case- match for whole sequence in
database- determination of structure and function - In between- partial matches across sequence to
diverse or hypothetical proteins - Difficult case- no match, have to derive
information from amino acid properties, pattern
searches etc
14From sequence to function
15Predicting function from sequence similarity
- Orthologues- arose from speciation, same gene in
different organisms -can have lt30 homology - Paralogues- from duplication within a genome,
second copy may have new or changed function - (difficult to distinguish between otho- and
paralogues unless whole genome is available) - Equivalog- proteins with equivalent functions
- Analog- proteins catalyzing same reaction but not
structurally related - Some enzymes may have seq similarity simply
because common catalytic site, substrate,
pathway.
16TYPES OF HOMOLOGY
PROTEIN/DOMAIN
Superfamily
Duplication within species
Paralogues may have different functions
A
B
Speciation
Orthologues may have different functions, if
same - Equivalogs
B2
B1
17Sequence homology in genomes
- When you do a whole genome BLAST search there is
a general pattern of results
Maverick genes tend to diverge more frequently
than core genes
18Using homology information for automatic
annotation- automatic annotation of TrEMBL as an
example
19Requirements for automatic annotation
- Well-annotated reference database (eg SWISS-PROT)
- Highly reliable diagnostic protein family
signature database with the means to assign
proteins to groups (eg CDD, InterPro, IProClass) - A RuleBase to store and manage the annotation
rules, their sources and their usage
20Direct Transfer
- Search target
- Transfer annotation to target database
- ExampleFASTA against sequence database and
transfer of DE line of best hit
XDB
Target
21Multiple Sources
- Usually more than one external database is used
- Combine the different results
XDB
Target
22Conflicts
- Contradiction
- Inconsistencies
- Synonyms
- Redundancy
23Translation
- Use a translator to map XDB language to target
language
XDB
Target
24Translation Examples
- ENZYME ??TrEMBL CA L-ALANINED-ALANINECC -!-
CATALYTIC ACTIVITY L-ALANINECC D-ALANINE. - PROSITE ??TrEMBL/SITE3,heme_ironFT METAL
IRON - Pfam ??TrEMBL FT DOMAIN zf_C3HC4FT
ZN_FING C3HC4-TYPE
25Demands on a system for automated data analysis
and annotation
- Correctness
- Scalability
- Updateable
- Low level of redundant information
- Completeness
- Standardized vocabulary
26What do we have?
- SWISS-PROT
- RuleBase
- TrEMBL
- PROSITE (and Pfam, PRINTS, ProDom, SMART, Blocks
etc) - SWISS-PROT/TrEMBL/RuleBase in Oracle
27Standardized transfer of annotation from
characterized proteins in SWISS-PROT to TrEMBL
entries
- TrEMBL entry is reliably recognized by a given
method as a member of a certain group of proteins - corresponding group of proteins in SWISS-PROT
shares certain annotation - common annotation is transferred to the TrEMBL
entry and flagged as annotated by similarity
28Automatic annotation information flow
- Get information necessary to assign proteins to
groups eg using InterPro or other biological or
family information- store in RuleBase - Group proteins in SWISS-PROT by these conditions
- Extract common annotation shared by all these
proteins- store in RuleBase - Group unannotated sequences by the conditions
- Transfer common annotation flagged with evidence
tags - Note can add taxonomic constraints
29Extract Reference Entries
- Use XDB to extract entries from standard database
- ExamplePfamPF00509 HemagglutininHEMA_IAVI7/P03
435HEMA_IANT6/P03436HEMA_IAAIC/P03437HEMA_IAX31
/P03438HEMA_IAME2/P03439HEMA_IAEN7/P03440HEMA_I
ABAN/P03441HEMA_IADU3/P03442HEMA_IADA1/P03443HE
MA_IADMA/P03444HEMA_IADM1/P03445HEMA_IADA2/P0344
6HEMA_IASH5/P03447
Pfam
TrEMBL
SWISS-PROT
30Extract Common Annotation
- 132 entries read131 ID HEMA_XXXXX125 DE
HEMAGGLUTININ PRECURSOR. 6 DE
HEMAGGLUTININ.131 GN HA130 CC -!- FUNCTION
HEMAGGLUTININ IS RESPONSIBLE FOR ATTACHING
THE130 CC VIRUS TO CELL RECEPTORS AND FOR
INITIATING INFECTION.125 CC -!- SUBUNIT
HOMOTRIMER. EACH OF THE MONOMER IS FORMED BY
TWO125 CC CHAINS (HA1 AND HA2) LINKED BY A
DISULFIDE BOND. 75 DR HSSP P03437 1HGD. 31
DR HSSP P03437 1DLH.131 KW HEMAGGLUTININ
GLYCOPROTEIN ENVELOPE PROTEIN102 KW SIGNAL
1 KW COAT PROTEIN POLYPROTEIN
3D-STRUCTURE130 FT CHAIN
HA1 CHAIN.107 FT CHAIN
HA2 CHAIN.102 FT SIGNAL
31Store Common Annotation
- Store the used conditions and the extracted
common annotation in a separate database
XDB
TrEMBL
SWISS-PROT
RuleBase
32RULES
- Rules describe
- the content of the annotation to be transferred
(ACTIONS), - the CONDITIONS which the target TrEMBL entry
must fulfill in order to allow transfer of the
annotation. - Rules uniquely describe or delineate a set of
SWISS-PROT entries. - The common annotation in these entries is
transferred to TrEMBL.
33// RULE RU000482 DATE 2001-01-11 USER
OPSWFL PACK PROSITE ?PSAC PS00449 ?EMOT
PS00449 !ECNO 3.6.1.34 !SPDE ATP synthase A
chain !CCFU KEY COMPONENT OF THE PROTON
CHANNEL IT MAY PLAY A DIRECT ROLE IN THE
TRANSLOCATION OF PROTONS ACROSS THE MEMBRANE (BY
SIMILARITY) !CCSU F-TYPE ATPASES HAVE 2
COMPONENTS, CF(1) - THE CATALYTIC CORE - AND
CF(0) - THE MEMBRANE PROTON CHANNEL. CF(1) HAS
FIVE SUBUNITS ALPHA(3), BETA(3), GAMM A(1),
DELTA(1), EPSILON(1). CF(0) HAS THREE MAIN
SUBUNITS A, B AND C (BY SIMILARITY) !CCLO
INTEGRAL MEMBRANE PROTEIN (By Similarity) !CCSI
TO THE ATPASE A CHAIN FAMILY !SPKW CF(0) !SPKW
Hydrogen ion transport !SPKW Transmembrane //
ACTIONS
CONDITIONS
34Add Annotation to Target
- Use conditions to extract entries from TrEMBL
- Add common annotation to the entries
XDB
TrEMBL
SWISS-PROT
RuleBase
35Automatic annotation using multiple dbs
- Extract conditions from XDB
- Group SWISS-PROT by conditions
- Extract common annotation
- Group TrEMBL by conditions
- Add common annotation to TrEMBL
ENZYME
Pfam
INTERPRO
PROSITE
TrEMBL
SWISS-PROT
RuleBase
36Using tree structure of InterPro
37RU000652 with additional condition connected by
AND
// RULE RU000652 DATE 2001-01-11 USER
OPSWFL PACK PROSITE ?IPRO IPR002379 ?PSAC
PS00605 ?EMOT PS00605 !SPDE ATP synthase C
chain (Lipid-binding protein) (Subunit C) !ECNO
3.6.1.34 !CCSU F-TYPE ATPASES HAVE 2
COMPONENTS, CF(1) - THE CATALYTIC CORE - AND
CF(0) - THE MEMBRANE PROTON CHANNEL. CF(1) HAS
FIVE SUBUNITS ALPHA(3), BETA(3), GAMMA(1),
DELTA(1), EPSILON(1). CF(0) HAS THREE MAIN
SUBUNITS A, B AND C (By Similarity) !CCSI TO
THE ATPASE C CHAIN FAMILY !SPKW CF(0) !SPKW
Hydrogen ion transport !SPKW Lipid-binding !SPKW
Transmembrane //
Additional condition (parent signature)
38Condition types
- Signature hits
- - Prosite, Prints, Pfam, Prodom
- Taxonomy
- - Broad groups like
- Archaea
- Bacteriophage
- Eukaryota
- Prokaryota
- Eukaryotic viruses
-
- - more specific such as species
- Organelle
- Conditions
- Negated conditions
39Rule-building
- Grouping and extraction of common annotation
- - semi automated but involves manual data-mining
- assisted by perl/shell scripts.
- Algorithmic data-mining
- - fully automated.
- - fast.
- - exhaustive exploration of condition-set/annota
tion - search-space .
- - non-biological, validity of rules being
assessed - by comparison with semi-manual
approach. -
40Advantages of this method
- Uses reliable ref database, prevents propagation
of incorrect annotation - Using common annotation of multiple entries,
lower over-prediction than from best hit of BLAST - Can standardize annotation and nomenclature of
target sequences, since reference is standardized - Can have different levels of common annotation
from different levels of family hierarchy - Independent of multi-domain organisation
- Evidence tags allow for easy tracking and updating
41Pitfalls of automatic functional analysis
- Multifunctional proteins- genome projects often
assign single function, info is lost in homology
search - Hypothetical proteins (40 oRFs unknown), and
poorly or even wrongly annotated proteins - No coverage of position-specific annotation eg
active sites - Current methods provide only a phrase describing
some properties of the unknown protein - It is important to have evidence for all
annotation added
42EVIDENCE TAGS
43(No Transcript)
44Predicting function from non-homology
- Look at position of genes relative to others,
compare with other organisms - Can still build up rules from annotated sequences
using information you have on other features like
fold, physical properties etc. - Use physical properties and known attributes
45Protein functions from regions
- Active sites- short, highly conserved regions
- Loops- charged residues and variable sequence
- Interior of protein- conservation of charged
amino acids
46Protein functions from specific residues
- Polar (C,D,E,H,K,N,Q,R,S,T) - active sites
- Aromatic (F,H,W,Y) - protein ligand- binding
sites - Zn-coord (C,D,E,H,N,Q) - active site, zinc
finger - Ca2-coord (D,E,N,Q) - ligand-binding site
- Mg/Mn-coord (D,E,N,S,R,T) - Mg2 or Mn2
catalysis, ligand binding - Ph-bind (H,K,R,S,T) - phosphate and sulphate
binding
- C disulphide-rich, metallo- thionein, zinc
fingers - DE acidic proteins (unknown)
- G collagens
- H histidine-rich glycoprotein
- KR nuclear proteins, nuclear localisation
- P collagen, filaments
- SR RNA binding motifs
- ST mucins
47Supplement annotation with Xrefs to other
databases
- DDBJ/EMBL/GenBank Nucleotide Sequence Database
- PDB
- Genomic databases (FlyBase, MGD, SGD)
- 2D-Gel databases (ECO2DBASE, SWISS-2DPAGE,
Aarhus/Ghent, YEPD, Harefield), Gene expression
data - Specialized collections (OMIM, InterPro, PROSITE,
PRINTS, PFAM, ProDom, SMART, ENZYME, GPCRDB,
Transfac, HSSP)
48Approaches to functional annotation
- Automatic annotation (sequence homology, rules,
transfer info from pdb) - Automatic classification (pattern databases,
clustering, structure) - Automatic characterisation (functional databases)
- Context information (comparitive genome analysis,
metabolic pathway databases) - Experimental results (2D gels, microarrays)
- Full manual annotation (SWISS-PROT style)
49AUTOMATIC CLASSIFICATION
Annotation can by using Clustering methods eg
CluSTR (EBI), and pattern searches (InterPro
etc)- classification of proteins into different
families
50(No Transcript)
51(No Transcript)
52AUTOMATIC CHARACTERIZATION- FUNCTIONAL ANNOTATION
SCHEMES
- First attempt Riley classification of E.coli
- Genome sequencing projects driving force
- Need standardised system and vocabulary
- Functional schemes normally hierarchies of
different levels of generalisation
53Databases for Functional Information
- KEGG -Kyoto encyclopedia of genes and genomes
- (http//www.genome.ad.jp/kegg/)
- Links genome information (GENES database) to high
order functional information stored in PATHWAY
database. - Also has LIGAND database for chemical compounds,
molecules and reactions. - PEDANT -Protein Extraction, Description and
Analysis Tool - (http//pedant.gsf.de/)
- Annotation for complete and incomplete genomes
eg. List of ORFs, EC numbers, functional
categories, list seqs with homologs, gene
clusters, domain hits, TM, structure links,
search facility for sequences etc - WIT What is there
- ( http//www.cme.msu.edu/WIT)
- Database of metabolic pathways, can text search
for ORFs, pathways, enzymes
54Databases for Functional Information (2)
- COG -Clusters of Orthologous Groups
- (http//www.ncbi.nlm.nih.gov/COG)
- Phylogenetic classification of proteins encoded
in complete genomes. - Contains 2791 COGs including 30 genomes.
- COGs thought to contain orthologous proteins,
classified into broad functional categories
(transciption, replication, cell division). - COGNITOR assigns proteins to COGs based on
best-hit, divides multi-domain proteins - Can compare results with complete genomes, look
for missing functions - GO Gene Ontology
- (http//www.geneontology.org)
- Standard vocabulary first used for mouse, fly and
yeast - Three ontologies molecular function, biological
process and cellular component -
55Databases for Functional Information (3)
- MIPSMYGD FunCat Functional catalogue (yeast)
http//www.mips.biochem.mpg.de/proj/yeast - EcoCyc -Encyclopedia of E. coli Genes and
Metabolism http//ecocyc.doubletwist.com/ecocyc/e
cocyc.html - Enzyme database http//wwwexpasy.ch/sprot/enzym
e.html - TIGR Gene identification list http//www.tigr.org
/tdb/mdb/mdb.html - All schemes have different depths, breadths and
resolutions - Schemes need to be applicable to all organisms,
standardized for comparisons and permit multiple
assignments
56Assignment of function
- Use a combination of databases, especially those
with standardised functional information - Search function databases with sequences to find
matches -assign function eg PENDANT, PIR
superfamilies, COGs, GO (via InterPro)
57FUNCTIONAL CLASSIFICATION USING INTERPRO
- InterPro classification with 3-4 letter codes
- Mapping of InterPro entries to GO
- GO- Gene Ontology (SGD, FB MGD) universal
ontology for - molecular function
- biological process
- cellular component
58Classification of IPRs
CGD Cell cycle/growth/death -CGDc cell
cycle/division -CGDg cell growth/development -CGDd
cell death CYS Cytoskeletal/structural -CYSc
cytoskeletal -CYSs structural -CYSv virus
coat/capsid protein DPT Defense/pathogenesis/tox
in DRG DNA/RNA-binding/regulation DRM DNA/RNA
metabolism -DRMr DNA repair/recombination -DRMp
DNA replication -DRMm DNA/RNA modification -DRMt
transcription/translation -DRMb ribosomal
protein
MET Metabolism -METs substrate metabolism
-METe electron transfer -METa amino acid
metabolism -METn nucleic acid metabolism
-METm metal binding proteins OTH Other
functions -OTHm cell motility -OTHt
transposition -OTHa cell adhesion -OTHg
miscellaneous functions -OTHh hormones -OTHi
immune-response proteins -OTHf multifunctional
proteins -OTHo multifunctional domains PFD
Protein folding degradation -PFDc chaperone
-PFDp protease/endopeptidase -PFDi
protease inhibitor
PRG Protein-binding/other regulation -PRGg
GPCRs -PRGr other receptors -PRGo other
regulation STD Signal transduction -STDk
sig transduction kinases -STDp sig transduction
phosphatases -STDr sig transduction response
reg -STDs sig transduction sensors -STDc
cell signalling TRS Transport and
secretion -TRSt transport (subtrates) -TRSi
transport (ions) -TRSs secretion -TRSr
carrier proteins UNK Unknown function
59(No Transcript)
60(No Transcript)
61Pie charts of whole proteome analysis of 4
organisms
62Distribution of protein functions
63GENOME ANNOTATION TOOLS
- Oakridge Genome Annotation Channel
(http//compbio.ornl.gov/channel/) - ENSEMBL (http//ensembl.ebi.ac.uk)
- Artemis (http//www.sanger.ac.uk/Software/Artemis)
Sequence viewer and annotation tool - GeneQuiz (http//www.sander.ebi.ac.uk/genequiz/)
System for automated annotation of sequences, web
access required - Genome Annotation Assessment Project (GASP1)
(http//www.fruitfly.org/GASP1)
64PEDANT SYSTEM
Layer 1 bioinformatics tools
Databases for searching
PSI-BLAST IMPALA PREDATOR CLUSTALW TMAP
SIGNALP SEG PROSEARCH COILS HMMER
MIPS PROSITE BLOCKS PIR COGS
parser of results
Layer 2 database to store information -MySQL
Manual annotation tool
Layer 3 user interface to display results
Programs written in Perl5 and some in C
-portable. Processing of one sequence takes about
3 minutes
65Summary of protein sequence annotation
- Mask compositionally-biased and coiled-coil
regions - Identify transmembrane regions, signal peptides,
GPI anchors - Predict secondary structure
- Look for known domains from protein pattern
databases - Search sequence database for similar sequences
- If no or few results search with subsequences, do
iterative searches - Functional annotation consider function of each
domain present, annotation from database
homologs, function from hits with 3D structure
66SUMMARY OF ANNOTATION PIPELINE
NB look out for multi-domain proteins, put into
genome context
Supplement with manual curation and use evidence
tags
67LIMITS OF PROTEIN SEQUENCE ANALYSIS
- Predicting function from sequence requires
another sequence to be mapped to a function many
hypothetical proteins in db and UPFs - If sequence homologues are found, may not be
functional homologues -qualitative rather than
quantitative process - - orthologues may have different functions
- -enzyme homologues may be inactive
- -equivalent functions may use different genes,
not orthologue - Analogy can infer molecular function, but not
necessarily cellular function
68LIMITS OF PROTEIN SEQUENCE ANALYSIS (2)
- Databases are biased in sequence and aa
composition and search is dependent on size - If no homology found- limited amount of
information can be inferred - Incorrect annotation can be propagated when
similarity is over part on sequence not used in
annotation - No answers to tissue-specificity, binding of
ligands, relationship between genotype and
phenotype
69LIMITS OF PROTEIN SEQUENCE ANALYSIS (3)
- Need additional information from experiments, eg
can predict glycosylation sites, but not kind of
sugar attached - Problem with multidomain proteins (assign
orthology on basis of domains or domain
composition of whole protein?) -check also known
domain architectures and their taxonomic
limitations
70Using different approaches to functional
annotation Status for SPTR
- Automatic annotation (RuleBase) 20 of all
protein sequences/20 of all new sequences - Automatic classification (InterPro, CluSTr,
Structure) 60 of all protein sequences/60 of
all new sequences - Automatic characterisation (GO) 40 of all
protein sequences/40 of all new sequences - Full annotation (SWISS-PROT style) 20 of all
protein sequences/5 of all new sequences
71Using different approaches to functional
annotation Future for SPTR
- Automatic annotation (RuleBase) 50 of all
protein sequences in 2004 - Automatic classification (InterPro, CluSTr,
Structure) 90 of all protein sequences in 2004 - Automatic characterisation (GO) 70 of all
protein sequences in 2004 - Full annotation (SWISS-PROT style) 10 of all
protein sequences in 2004
72IMPORTANT TO NOTE
- DONT COMPLETELY TRUST COMPUTER RESULTS
- CHECK LITERATURE
- CONFIRM WITH WETLAB WORK- mutational analysis
gives valuable info about function - COMPROMISE BETWEEN OVER AND UNDER-PREDICTIONS
-overpredictions can be checked by curators,
easier to delete than find missing info.