GENOME%20ANNOTATION%20AND%20FUNCTIONAL%20GENOMICS%20The%20protein%20sequence%20perspective

About This Presentation

Title:

GENOME%20ANNOTATION%20AND%20FUNCTIONAL%20GENOMICS%20The%20protein%20sequence%20perspective

Description:

GENOME ANNOTATION AND FUNCTIONAL GENOMICS The protein sequence perspective – PowerPoint PPT presentation

Number of Views:225

Avg rating:3.0/5.0

Slides: 73

Provided by: Beat278

Category:

more less

Transcript and Presenter's Notes

Title: GENOME%20ANNOTATION%20AND%20FUNCTIONAL%20GENOMICS%20The%20protein%20sequence%20perspective

1
GENOME ANNOTATION AND FUNCTIONAL GENOMICSThe
protein sequence perspective
2
GENOME ANNOTATION

Two main levels
STRUCTURAL ANNOTATION Finding genes and other
biologically relevant sites thus building up a
model of genome as objects with specific
locations
FUNCTIONAL ANNOTATION Objects are used in
database searches (and expts) aim is attributing
biologically relevant information to whole
sequence and individual objects

3
WHY PROTEIN RATHER THAN DNA?

Larger alphabet -more sensitive comparisons
Protein sequences lower signal to noise ratio
Less redundancy and no frameshifts
Each aa has different properties like size,
charge etc
Closer to biological function
3D structure of similar proteins may be known
Evolutionary relationships more evident
Availability of good, well annotated protein
sequence and pattern databases

4
Large-scale genome analysis projects

Rate-limiting step is annotation
Whole genome availability provides context
information
Main goal is to bridge gap between genotype and
phenotype

5
Definitions of Annotation

Addition of as much reliable and up-to-date
information as possible to describe a sequence
Identification, structural description,
characterisation of putative protein products and
other features in primary genomic sequence
Information attached to genomic coordinates with
start and end point, can occur at different
levels
Interpreting raw sequence data into useful
biological information

6
ANNOTATION/FUNCTION CAN BE MAPPED TO DIFFERENT
LEVELS

? ORGANISM -phenotypic function (morphology,
physiology, behaviour, environemntal response),
context NB
? CELLULAR -metabolic pathway, signal cascades,
cellular localisation. Context dependent
? MOLECULAR -binding sites, catalytic activity,
PTM, 3D structure
? DOMAIN
? SINGLE RESIDUE

7
Annotation is the description of

Function(s) of the protein
Post-translational modification(s)
Domains and sites
Secondary structure
Quaternary structure
Similarities to other proteins
Disease(s) associated with deficiencie(s) in the
protein
Sequence conflicts, variants, etc.

8
Additional information for proteins

ALTERNATIVE PRODUCTS
CATALYTIC ACTIVITY
COFACTOR
DEVELOPMENTAL STAGE
DISEASE
DOMAIN
ENZYME REGULATION
FUNCTION
INDUCTION

PATHWAY
PHARMACEUTICALS
POLYMORPHISM
PTM
SIMILARITY
SUBCELLULAR LOCATION
SUBUNIT
TISSUE SPECIFICITY

9
Amino-acid sites are

Post-translational modification of a residue
Covalent binding of a lipidic moiety
Disulfide bond
Thiolester bond
Thioether bond
Glycosylation site
Binding site for a metal ion
Binding site for any chemical group (co-enzyme,
prosthetic group, etc.)

10
Regions

SIGNAL SEQUENCE
TRANSIT PEPTIDE
PROPEPTIDE
CHAIN
PEPTIDE
DOMAIN

ACTIVE SITE
DNA BIND SITE
METAL BIND SITE
MOLECULE BIND SITE
TRANSMEMBRANE

11
Annotation sources

publications that report new sequence data
review articles to periodically update the
annotation of families or groups of proteins
external experts
protein sequence analysis

12
Approaches to functional annotation

Automatic annotation (sequence homology, rules,
transfer info from pdb)
Automatic classification (pattern databases,
clustering, structure)
Automatic characterisation (functional databases)
Context information (comparitive genome analysis,
metabolic pathway databases)
Experimental results (2D gels, microarrays)
Full manual annotation (SWISS-PROT style)

13
PROTEIN SEQUENCE ANALYSIS

Protein sequence can come from gene predictions,
literature or peptide sequencing
Analysis on different levels
molecular
cellular
organism
Simplest case- match for whole sequence in
database- determination of structure and function
In between- partial matches across sequence to
diverse or hypothetical proteins
Difficult case- no match, have to derive
information from amino acid properties, pattern
searches etc

14
From sequence to function
15
Predicting function from sequence similarity

Orthologues- arose from speciation, same gene in
different organisms -can have lt30 homology
Paralogues- from duplication within a genome,
second copy may have new or changed function
(difficult to distinguish between otho- and
paralogues unless whole genome is available)
Equivalog- proteins with equivalent functions
Analog- proteins catalyzing same reaction but not
structurally related
Some enzymes may have seq similarity simply
because common catalytic site, substrate,
pathway.

16
TYPES OF HOMOLOGY
PROTEIN/DOMAIN
Superfamily
Duplication within species
Paralogues may have different functions
A
B
Speciation
Orthologues may have different functions, if
same - Equivalogs
B2
B1
17
Sequence homology in genomes

When you do a whole genome BLAST search there is
a general pattern of results

Maverick genes tend to diverge more frequently
than core genes
18
Using homology information for automatic
annotation- automatic annotation of TrEMBL as an
example
19
Requirements for automatic annotation

Well-annotated reference database (eg SWISS-PROT)
Highly reliable diagnostic protein family
signature database with the means to assign
proteins to groups (eg CDD, InterPro, IProClass)
A RuleBase to store and manage the annotation
rules, their sources and their usage

20
Direct Transfer

Search target
Transfer annotation to target database
ExampleFASTA against sequence database and
transfer of DE line of best hit

XDB
Target
21
Multiple Sources

Usually more than one external database is used
Combine the different results

XDB
Target
22
Conflicts

Contradiction
Inconsistencies
Synonyms
Redundancy

23
Translation

Use a translator to map XDB language to target
language

XDB
Target
24
Translation Examples

ENZYME ??TrEMBL CA L-ALANINED-ALANINECC -!-
CATALYTIC ACTIVITY L-ALANINECC D-ALANINE.
PROSITE ??TrEMBL/SITE3,heme_ironFT METAL
IRON
Pfam ??TrEMBL FT DOMAIN zf_C3HC4FT
ZN_FING C3HC4-TYPE

25
Demands on a system for automated data analysis
and annotation

Correctness
Scalability
Updateable
Low level of redundant information
Completeness
Standardized vocabulary

26
What do we have?

SWISS-PROT
RuleBase
TrEMBL
PROSITE (and Pfam, PRINTS, ProDom, SMART, Blocks
etc)
SWISS-PROT/TrEMBL/RuleBase in Oracle

27
Standardized transfer of annotation from
characterized proteins in SWISS-PROT to TrEMBL
entries

TrEMBL entry is reliably recognized by a given
method as a member of a certain group of proteins
corresponding group of proteins in SWISS-PROT
shares certain annotation
common annotation is transferred to the TrEMBL
entry and flagged as annotated by similarity

28
Automatic annotation information flow

Get information necessary to assign proteins to
groups eg using InterPro or other biological or
family information- store in RuleBase
Group proteins in SWISS-PROT by these conditions
Extract common annotation shared by all these
proteins- store in RuleBase
Group unannotated sequences by the conditions
Transfer common annotation flagged with evidence
tags
Note can add taxonomic constraints

29
Extract Reference Entries

Use XDB to extract entries from standard database
ExamplePfamPF00509 HemagglutininHEMA_IAVI7/P03
435HEMA_IANT6/P03436HEMA_IAAIC/P03437HEMA_IAX31
/P03438HEMA_IAME2/P03439HEMA_IAEN7/P03440HEMA_I
ABAN/P03441HEMA_IADU3/P03442HEMA_IADA1/P03443HE
MA_IADMA/P03444HEMA_IADM1/P03445HEMA_IADA2/P0344
6HEMA_IASH5/P03447

Pfam
TrEMBL
SWISS-PROT
30
Extract Common Annotation

132 entries read131 ID HEMA_XXXXX125 DE
HEMAGGLUTININ PRECURSOR. 6 DE
HEMAGGLUTININ.131 GN HA130 CC -!- FUNCTION
HEMAGGLUTININ IS RESPONSIBLE FOR ATTACHING
THE130 CC VIRUS TO CELL RECEPTORS AND FOR
INITIATING INFECTION.125 CC -!- SUBUNIT
HOMOTRIMER. EACH OF THE MONOMER IS FORMED BY
TWO125 CC CHAINS (HA1 AND HA2) LINKED BY A
DISULFIDE BOND. 75 DR HSSP P03437 1HGD. 31
DR HSSP P03437 1DLH.131 KW HEMAGGLUTININ
GLYCOPROTEIN ENVELOPE PROTEIN102 KW SIGNAL
1 KW COAT PROTEIN POLYPROTEIN
3D-STRUCTURE130 FT CHAIN
HA1 CHAIN.107 FT CHAIN
HA2 CHAIN.102 FT SIGNAL

31
Store Common Annotation

Store the used conditions and the extracted
common annotation in a separate database

XDB
TrEMBL
SWISS-PROT
RuleBase
32
RULES

Rules describe
the content of the annotation to be transferred
(ACTIONS),
the CONDITIONS which the target TrEMBL entry
must fulfill in order to allow transfer of the
annotation.
Rules uniquely describe or delineate a set of
SWISS-PROT entries.
The common annotation in these entries is
transferred to TrEMBL.

33
// RULE RU000482 DATE 2001-01-11 USER
OPSWFL PACK PROSITE ?PSAC PS00449 ?EMOT
PS00449 !ECNO 3.6.1.34 !SPDE ATP synthase A
chain !CCFU KEY COMPONENT OF THE PROTON
CHANNEL IT MAY PLAY A DIRECT ROLE IN THE
TRANSLOCATION OF PROTONS ACROSS THE MEMBRANE (BY
SIMILARITY) !CCSU F-TYPE ATPASES HAVE 2
COMPONENTS, CF(1) - THE CATALYTIC CORE - AND
CF(0) - THE MEMBRANE PROTON CHANNEL. CF(1) HAS
FIVE SUBUNITS ALPHA(3), BETA(3), GAMM A(1),
DELTA(1), EPSILON(1). CF(0) HAS THREE MAIN
SUBUNITS A, B AND C (BY SIMILARITY) !CCLO
INTEGRAL MEMBRANE PROTEIN (By Similarity) !CCSI
TO THE ATPASE A CHAIN FAMILY !SPKW CF(0) !SPKW
Hydrogen ion transport !SPKW Transmembrane //
ACTIONS

CONDITIONS
34
Add Annotation to Target

Use conditions to extract entries from TrEMBL
Add common annotation to the entries

XDB
TrEMBL
SWISS-PROT
RuleBase
35
Automatic annotation using multiple dbs

Extract conditions from XDB
Group SWISS-PROT by conditions
Extract common annotation
Group TrEMBL by conditions
Add common annotation to TrEMBL

ENZYME
Pfam
INTERPRO
PROSITE
TrEMBL
SWISS-PROT
RuleBase
36
Using tree structure of InterPro
37
RU000652 with additional condition connected by
AND
// RULE RU000652 DATE 2001-01-11 USER
OPSWFL PACK PROSITE ?IPRO IPR002379 ?PSAC
PS00605 ?EMOT PS00605 !SPDE ATP synthase C
chain (Lipid-binding protein) (Subunit C) !ECNO
3.6.1.34 !CCSU F-TYPE ATPASES HAVE 2
COMPONENTS, CF(1) - THE CATALYTIC CORE - AND
CF(0) - THE MEMBRANE PROTON CHANNEL. CF(1) HAS
FIVE SUBUNITS ALPHA(3), BETA(3), GAMMA(1),
DELTA(1), EPSILON(1). CF(0) HAS THREE MAIN
SUBUNITS A, B AND C (By Similarity) !CCSI TO
THE ATPASE C CHAIN FAMILY !SPKW CF(0) !SPKW
Hydrogen ion transport !SPKW Lipid-binding !SPKW
Transmembrane //
Additional condition (parent signature)
38
Condition types

Signature hits
- Prosite, Prints, Pfam, Prodom
Taxonomy
- Broad groups like
Archaea
Bacteriophage
Eukaryota
Prokaryota
Eukaryotic viruses
- more specific such as species
Organelle

Conditions
Negated conditions

39
Rule-building

Grouping and extraction of common annotation
- semi automated but involves manual data-mining
assisted by perl/shell scripts.
Algorithmic data-mining
- fully automated.
- fast.
- exhaustive exploration of condition-set/annota
tion
search-space .
- non-biological, validity of rules being
assessed
by comparison with semi-manual
approach.

40
Advantages of this method

Uses reliable ref database, prevents propagation
of incorrect annotation
Using common annotation of multiple entries,
lower over-prediction than from best hit of BLAST
Can standardize annotation and nomenclature of
target sequences, since reference is standardized
Can have different levels of common annotation
from different levels of family hierarchy
Independent of multi-domain organisation
Evidence tags allow for easy tracking and updating

41
Pitfalls of automatic functional analysis

Multifunctional proteins- genome projects often
assign single function, info is lost in homology
search
Hypothetical proteins (40 oRFs unknown), and
poorly or even wrongly annotated proteins
No coverage of position-specific annotation eg
active sites
Current methods provide only a phrase describing
some properties of the unknown protein
It is important to have evidence for all
annotation added

42
EVIDENCE TAGS
43
(No Transcript)
44
Predicting function from non-homology

Look at position of genes relative to others,
compare with other organisms
Can still build up rules from annotated sequences
using information you have on other features like
fold, physical properties etc.
Use physical properties and known attributes

45
Protein functions from regions

Active sites- short, highly conserved regions
Loops- charged residues and variable sequence
Interior of protein- conservation of charged
amino acids

46
Protein functions from specific residues

Polar (C,D,E,H,K,N,Q,R,S,T) - active sites
Aromatic (F,H,W,Y) - protein ligand- binding
sites
Zn-coord (C,D,E,H,N,Q) - active site, zinc
finger
Ca2-coord (D,E,N,Q) - ligand-binding site
Mg/Mn-coord (D,E,N,S,R,T) - Mg2 or Mn2
catalysis, ligand binding
Ph-bind (H,K,R,S,T) - phosphate and sulphate
binding

C disulphide-rich, metallo- thionein, zinc
fingers
DE acidic proteins (unknown)
G collagens
H histidine-rich glycoprotein
KR nuclear proteins, nuclear localisation
P collagen, filaments
SR RNA binding motifs
ST mucins

47
Supplement annotation with Xrefs to other
databases

DDBJ/EMBL/GenBank Nucleotide Sequence Database
PDB
Genomic databases (FlyBase, MGD, SGD)
2D-Gel databases (ECO2DBASE, SWISS-2DPAGE,
Aarhus/Ghent, YEPD, Harefield), Gene expression
data
Specialized collections (OMIM, InterPro, PROSITE,
PRINTS, PFAM, ProDom, SMART, ENZYME, GPCRDB,
Transfac, HSSP)

48
Approaches to functional annotation

Automatic annotation (sequence homology, rules,
transfer info from pdb)
Automatic classification (pattern databases,
clustering, structure)
Automatic characterisation (functional databases)
Context information (comparitive genome analysis,
metabolic pathway databases)
Experimental results (2D gels, microarrays)
Full manual annotation (SWISS-PROT style)

49
AUTOMATIC CLASSIFICATION
Annotation can by using Clustering methods eg
CluSTR (EBI), and pattern searches (InterPro
etc)- classification of proteins into different
families
50
(No Transcript)
51
(No Transcript)
52
AUTOMATIC CHARACTERIZATION- FUNCTIONAL ANNOTATION
SCHEMES

First attempt Riley classification of E.coli
Genome sequencing projects driving force
Need standardised system and vocabulary
Functional schemes normally hierarchies of
different levels of generalisation

53
Databases for Functional Information

KEGG -Kyoto encyclopedia of genes and genomes
(http//www.genome.ad.jp/kegg/)
Links genome information (GENES database) to high
order functional information stored in PATHWAY
database.
Also has LIGAND database for chemical compounds,
molecules and reactions.
PEDANT -Protein Extraction, Description and
Analysis Tool
(http//pedant.gsf.de/)
Annotation for complete and incomplete genomes
eg. List of ORFs, EC numbers, functional
categories, list seqs with homologs, gene
clusters, domain hits, TM, structure links,
search facility for sequences etc
WIT What is there
( http//www.cme.msu.edu/WIT)
Database of metabolic pathways, can text search
for ORFs, pathways, enzymes

54
Databases for Functional Information (2)

COG -Clusters of Orthologous Groups
(http//www.ncbi.nlm.nih.gov/COG)
Phylogenetic classification of proteins encoded
in complete genomes.
Contains 2791 COGs including 30 genomes.
COGs thought to contain orthologous proteins,
classified into broad functional categories
(transciption, replication, cell division).
COGNITOR assigns proteins to COGs based on
best-hit, divides multi-domain proteins
Can compare results with complete genomes, look
for missing functions
GO Gene Ontology
(http//www.geneontology.org)
Standard vocabulary first used for mouse, fly and
yeast
Three ontologies molecular function, biological
process and cellular component

55
Databases for Functional Information (3)

MIPSMYGD FunCat Functional catalogue (yeast)
http//www.mips.biochem.mpg.de/proj/yeast
EcoCyc -Encyclopedia of E. coli Genes and
Metabolism http//ecocyc.doubletwist.com/ecocyc/e
cocyc.html
Enzyme database http//wwwexpasy.ch/sprot/enzym
e.html
TIGR Gene identification list http//www.tigr.org
/tdb/mdb/mdb.html
All schemes have different depths, breadths and
resolutions
Schemes need to be applicable to all organisms,
standardized for comparisons and permit multiple
assignments

56
Assignment of function

Use a combination of databases, especially those
with standardised functional information
Search function databases with sequences to find
matches -assign function eg PENDANT, PIR
superfamilies, COGs, GO (via InterPro)

57
FUNCTIONAL CLASSIFICATION USING INTERPRO

InterPro classification with 3-4 letter codes
Mapping of InterPro entries to GO
GO- Gene Ontology (SGD, FB MGD) universal
ontology for
molecular function
biological process
cellular component

58
Classification of IPRs
CGD Cell cycle/growth/death -CGDc cell
cycle/division -CGDg cell growth/development -CGDd
cell death CYS Cytoskeletal/structural -CYSc
cytoskeletal -CYSs structural -CYSv virus
coat/capsid protein DPT Defense/pathogenesis/tox
in DRG DNA/RNA-binding/regulation DRM DNA/RNA
metabolism -DRMr DNA repair/recombination -DRMp
DNA replication -DRMm DNA/RNA modification -DRMt
transcription/translation -DRMb ribosomal
protein
MET Metabolism -METs substrate metabolism
-METe electron transfer -METa amino acid
metabolism -METn nucleic acid metabolism
-METm metal binding proteins OTH Other
functions -OTHm cell motility -OTHt
transposition -OTHa cell adhesion -OTHg
miscellaneous functions -OTHh hormones -OTHi
immune-response proteins -OTHf multifunctional
proteins -OTHo multifunctional domains PFD
Protein folding degradation -PFDc chaperone
-PFDp protease/endopeptidase -PFDi
protease inhibitor
PRG Protein-binding/other regulation -PRGg
GPCRs -PRGr other receptors -PRGo other
regulation STD Signal transduction -STDk
sig transduction kinases -STDp sig transduction
phosphatases -STDr sig transduction response
reg -STDs sig transduction sensors -STDc
cell signalling TRS Transport and
secretion -TRSt transport (subtrates) -TRSi
transport (ions) -TRSs secretion -TRSr
carrier proteins UNK Unknown function
59
(No Transcript)
60
(No Transcript)
61
Pie charts of whole proteome analysis of 4
organisms
62
Distribution of protein functions
63
GENOME ANNOTATION TOOLS

Oakridge Genome Annotation Channel
(http//compbio.ornl.gov/channel/)
ENSEMBL (http//ensembl.ebi.ac.uk)
Artemis (http//www.sanger.ac.uk/Software/Artemis)
Sequence viewer and annotation tool
GeneQuiz (http//www.sander.ebi.ac.uk/genequiz/)
System for automated annotation of sequences, web
access required
Genome Annotation Assessment Project (GASP1)
(http//www.fruitfly.org/GASP1)

64
PEDANT SYSTEM
Layer 1 bioinformatics tools
Databases for searching
PSI-BLAST IMPALA PREDATOR CLUSTALW TMAP
SIGNALP SEG PROSEARCH COILS HMMER
MIPS PROSITE BLOCKS PIR COGS
parser of results
Layer 2 database to store information -MySQL
Manual annotation tool
Layer 3 user interface to display results
Programs written in Perl5 and some in C
-portable. Processing of one sequence takes about
3 minutes
65
Summary of protein sequence annotation

Mask compositionally-biased and coiled-coil
regions
Identify transmembrane regions, signal peptides,
GPI anchors
Predict secondary structure
Look for known domains from protein pattern
databases
Search sequence database for similar sequences
If no or few results search with subsequences, do
iterative searches
Functional annotation consider function of each
domain present, annotation from database
homologs, function from hits with 3D structure

66
SUMMARY OF ANNOTATION PIPELINE
NB look out for multi-domain proteins, put into
genome context
Supplement with manual curation and use evidence
tags
67
LIMITS OF PROTEIN SEQUENCE ANALYSIS

Predicting function from sequence requires
another sequence to be mapped to a function many
hypothetical proteins in db and UPFs
If sequence homologues are found, may not be
functional homologues -qualitative rather than
quantitative process
- orthologues may have different functions
-enzyme homologues may be inactive
-equivalent functions may use different genes,
not orthologue
Analogy can infer molecular function, but not
necessarily cellular function

68
LIMITS OF PROTEIN SEQUENCE ANALYSIS (2)

Databases are biased in sequence and aa
composition and search is dependent on size
If no homology found- limited amount of
information can be inferred
Incorrect annotation can be propagated when
similarity is over part on sequence not used in
annotation
No answers to tissue-specificity, binding of
ligands, relationship between genotype and
phenotype

69
LIMITS OF PROTEIN SEQUENCE ANALYSIS (3)

Need additional information from experiments, eg
can predict glycosylation sites, but not kind of
sugar attached
Problem with multidomain proteins (assign
orthology on basis of domains or domain
composition of whole protein?) -check also known
domain architectures and their taxonomic
limitations

70
Using different approaches to functional
annotation Status for SPTR

Automatic annotation (RuleBase) 20 of all
protein sequences/20 of all new sequences
Automatic classification (InterPro, CluSTr,
Structure) 60 of all protein sequences/60 of
all new sequences
Automatic characterisation (GO) 40 of all
protein sequences/40 of all new sequences
Full annotation (SWISS-PROT style) 20 of all
protein sequences/5 of all new sequences

71
Using different approaches to functional
annotation Future for SPTR

Automatic annotation (RuleBase) 50 of all
protein sequences in 2004
Automatic classification (InterPro, CluSTr,
Structure) 90 of all protein sequences in 2004
Automatic characterisation (GO) 70 of all
protein sequences in 2004
Full annotation (SWISS-PROT style) 10 of all
protein sequences in 2004

72
IMPORTANT TO NOTE

DONT COMPLETELY TRUST COMPUTER RESULTS
CHECK LITERATURE
CONFIRM WITH WETLAB WORK- mutational analysis
gives valuable info about function
COMPROMISE BETWEEN OVER AND UNDER-PREDICTIONS
-overpredictions can be checked by curators,
easier to delete than find missing info.

Write a Comment

User Comments (0)