Protein domains, function and associated prediction - PowerPoint PPT Presentation

About This Presentation
Title:

Protein domains, function and associated prediction

Description:

Title: PowerPoint Presentation Author: heringa Last modified by: heringa Created Date: 2/20/2003 5:34:46 PM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:405
Avg rating:3.0/5.0
Slides: 80
Provided by: heri4
Category:

less

Transcript and Presenter's Notes

Title: Protein domains, function and associated prediction


1
Lecture 14
Protein domains, function and associated
prediction
Introduction to Bioinformatics
2
Metabolomics fluxomics
3
Experimental
  • Structural genomics
  • Functional genomics
  • Protein-protein interaction
  • Metabolic pathways
  • Expression data

4
Issue when elucidating function experimentally
  • Typically done through knock-out experiments
  • Partial information (indirect interactions) and
    subsequent filling of the missing steps
  • Negative results (elements that have been shown
    not to interact, enzymes missing in an organism)
  • Putative interactions resulting from
    computational analyses

5
Protein function categories
  • Catalysis (enzymes)
  • Binding transport (active/passive)
  • Protein-DNA/RNA binding (e.g. histones,
    transcription factors)
  • Protein-protein interactions (e.g.
    antibody-lysozyme) (experimentally determined by
    yeast two-hybrid (Y2H) or bacterial two-hybrid
    (B2H) screening )
  • Protein-fatty acid binding (e.g. apolipoproteins)
  • Protein small molecules (drug interaction,
    structure decoding)
  • Structural component (e.g. ?-crystallin)
  • Regulation
  • Signalling
  • Transcription regulation
  • Immune system
  • Motor proteins (actin/myosin)

6
Catalytic properties of enzymes
Michaelis-Menten equation
Vmax S V -------------------
Km S
Vmax
  • Km kcat
  • E S ES E P
  • E enzyme
  • S substrate
  • ES enzyme-substrate complex (transition state)
  • P product
  • Km Michaelis constant
  • Kcat catalytic rate constant (turnover number)
  • Kcat/Km specificity constant (useful for
    comparison)

Moles/s
Vmax/2
Km
S
7
Protein interaction domains
http//pawsonlab.mshri.on.ca/html/domains.html
8
Energy difference upon binding
  • Examples of protein interactions (and of
    functional importance) include
  • Protein protein (pathway analysis)
  • Protein small molecules (drug interaction,
    structure decoding)
  • Protein peptides, DNA/RNA 
  • The change in Gibbs Free Energy of the
    protein-ligand binding interaction can be
    monitored and expressed by the following
    equation 
  • ? G ? H T ? S       
  • (HEnthalpy, SEntropy and TTemperature)

9
(No Transcript)
10
Protein function
  • Many proteins combine functions
  • Some immunoglobulin structures are thought to
    have more than 100 different functions (and
    active/binding sites)
  • Alternative splicing can generate (partially)
    alternative structures

11
Protein function Interaction
Active site / binding cleft
Shape complementarity
12
Protein function evolution
Chymotrypsin
... to a more elaborate active site with four
different features, all helping to optimise
proteolysis (cleavage)
From a simple ancestral active site for cutting
protein chains...
Gene duplication has resulted in two-domain
protein
13
Protein function evolution
Chymotrypsin
The active site lies between the two domains. It
consists of residues on the same two loops
(firstly between beta-strands 3 and 4, secondly
between beta strands 5 and 6) of each of the two
barrel domains. Four features of the active site
are indicated in the figure.
The Substrate Specificity Pocket
Main Chain Substrate-binding
The Oxyanion Hole (white)
Catalytic triad
Chymotrypsin cleaves peptides at the carboxyl
side of tyrosine, tryptophan, and phenylalanine
because those three amino acids contain phenyl
rings.
14
How to infer function
  • Experiment
  • Deduction from sequence
  • Multiple sequence alignment conservation
    patterns
  • Homology searching
  • Deduction from structure
  • Threading
  • Structure-structure comparison
  • Homology modelling

15
A domain is a
  • Compact, semi-independent unit (Richardson,
    1981).
  • Stable unit of a protein structure that can fold
    autonomously (Wetlaufer, 1973).
  • Recurring functional and evolutionary module
    (Bork, 1992).
  • Nature is a tinkerer and not an inventor
    (Jacob, 1977).
  • Smallest unit of function

16
Delineating domains is essential for
  • Obtaining high resolution structures (x-ray but
    particularly NMR size of proteins)
  • Sequence analysis
  • Multiple sequence alignment methods
  • Prediction algorithms (SS, Class,
    secondary/tertiary structure)
  • Fold recognition and threading
  • Elucidating the evolution, structure and function
    of a protein family (e.g. Rosetta Stone method)
  • Structural/functional genomics
  • Cross genome comparative analysis

17
Domain connectivity
linker
18
Structural domain organisation can be nasty
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel catalytic
substrate binding domain a/b nucleotide binding
domain
1 continuous 2 discontinuous domains
19
Domain size
  • The size of individual structural domains varies
    widely
  • from 36 residues in E-selectin to 692 residues in
    lipoxygenase-1 (Jones et al., 1998)
  • the majority (90) having less than 200 residues
    (Siddiqui and Barton, 1995)
  • with an average of about 100 residues (Islam et
    al., 1995).
  • Small domains (less than 40 residues) are often
    stabilised by metal ions or disulphide bonds.
  • Large domains (greater than 300 residues) are
    likely to consist of multiple hydrophobic cores
    (Garel, 1992).

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
Analysis of chain hydrophobicity in multidomain
proteins
28
Analysis of chain hydrophobicity in multidomain
proteins
29
Domain characteristics
  • Domains are genetically mobile units, and
    multidomain families are found in all three
    kingdoms (Archaea, Bacteria and Eukarya)
    underlining the finding that Nature is a
    tinkerer and not an inventor (Jacob, 1977).
  • The majority of genomic proteins, 75 in
    unicellular organisms and more than 80 in
    metazoa, are multidomain proteins created as a
    result of gene duplication events (Apic et al.,
    2001).
  • Domains in multidomain structures are likely to
    have once existed as independent proteins, and
    many domains in eukaryotic multidomain proteins
    can be found as independent proteins in
    prokaryotes (Davidson et al., 1993).

30
Protein function evolution- Gene (domain)
duplication -
Active site
Chymotrypsin
31
Pyruvate phosphate dikinase
  • 3-domain protein
  • Two domains catalyse 2-step reaction
  • A? B ? C
  • Third so-called swivelling domain actively
    brings intermediate enzymatic product (B) over
    45Ã… from one active site to the other

/
32
Pyruvate phosphate dikinase
  • 3-domain protein
  • Two domains catalyse 2-step reaction
  • A? B ? C
  • Third so-called swivelling domain actively
    brings intermediate enzymatic product (B) over
    45Ã… from one active site to the other

33
  • The DEATH Domain
  • Present in a variety of Eukaryotic proteins
    involved with cell death.
  • Six helices enclose a tightly packed hydrophobic
    core.
  • Some DEATH domains form homotypic and
    heterotypic dimers.

http//www.mshri.on.ca/pawson
34
Detecting Structural Domains
  • A structural domain may be detected as a compact,
    globular substructure with more interactions
    within itself than with the rest of the structure
    (Janin and Wodak, 1983).
  • Therefore, a structural domain can be determined
    by two shape characteristics compactness and its
    extent of isolation (Tsai and Nussinov, 1997).
  • Measures of local compactness in proteins have
    been used in many of the early methods of domain
    assignment (Rossmann et al., 1974 Crippen, 1978
    Rose, 1979 Go, 1978) and in several of the more
    recent methods (Holm and Sander, 1994 Islam et
    al., 1995 Siddiqui and Barton, 1995 Zehfus,
    1997 Taylor, 1999).

35
Detecting Structural Domains
  • However, approaches encounter problems when faced
    with discontinuous or highly associated domains
    and many definitions will require manual
    interpretation.
  • Consequently there are discrepancies between
    assignments made by domain databases (Hadley and
    Jones, 1999).

36
Detecting Domains using Sequence only
  • Even more difficult than prediction from
    structure!

37
Integrating protein multiple sequence alignment,
secondary and tertiary structure prediction in
order to predict structural domain boundaries in
sequence data
SnapDRAGON
  • Richard A. George
  • George R.A. and Heringa, J. (2002) J. Mol. Biol.,
    316, 839-851.
  •  

38
Protein structure hierarchical levels
39
Protein structure hierarchical levels
40
Protein structure hierarchical levels
41
Protein structure hierarchical levels
42
SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)
  • Input Multiple sequence alignment (MSA) and
    predicted secondary structure
  • Generate 100 DRAGON 3D models for the protein
    structure associated with the MSA
  • Assign domain boundaries to each of the 3D models
    (Taylor, 1999)
  • Sum proposed boundary positions within 100 models
    along the length of the sequence, and smooth
    boundaries using a weighted window

George R.A. and Heringa J.(2002) SnapDRAGON - a
method to delineate protein structural domains
from sequence data, J. Mol. Biol. 316, 839-851.
43
SnapDragon
Folds generated by Dragon
Multiple alignment
Boundary recognition (Taylor, 1999)
Predicted secondary structure
Summed and Smoothed Boundaries
CCHHHCCEEE
44
SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)
  • Input Multiple sequence alignment (MSA)
  • Sequence searches using PSI-BLAST (Altschul et
    al., 1997)
  • followed by sequence redundancy filtering using
    OBSTRUCT (Heringa et al.,1992)
  • and alignment by PRALINE (Heringa, 1999)
  • and predicted secondary structure
  • PREDATOR secondary structure prediction program

George R.A. and Heringa J.(2002) SnapDRAGON - a
method to delineate protein structural domains
from sequence data, J. Mol. Biol. 316, 839-851.
45
Domain prediction using DRAGON
Distance Regularisation Algorithm for Geometry
OptimisatioN
(Aszodi Taylor, 1994)
  • Folded protein models based on the requirement
    that (conserved) hydrophobic residues cluster
    together.
  • First construct a random high dimensional Ca
    distance matrix.
  • Distance geometry is used to find the 3D
    conformation corresponding to a prescribed target
    matrix of desired distances between residues.

46
SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)
  • Generate 100 DRAGON (Aszodi Taylor, 1994)
    models for the protein structure associated with
    the MSA
  • DRAGON folds proteins based on the requirement
    that (conserved) hydrophobic residues cluster
    together
  • (Predicted) secondary structures are used to
    further estimate distances between residues (e.g.
    between the first and last residue in a
    b-strand).
  • It first constructs a random high dimensional Ca
    (and pseudo Cb) distance matrix
  • Distance geometry is used to find the 3D
    conformation corresponding to a prescribed matrix
    of desired distances between residues (by gradual
    inertia projection and based on input MSA and
    predicted secondary structure)
  • DRAGON Distance Regularisation Algorithm for
    Geometry OptimisatioN

47
Multiple alignment
C? distance matrix
Target matrix
Predicted secondary structure
N
N
3
N
N
100 randomised initial matrices 100 predictions
CCHHHCCEEE
Input data
N
  • The C? distance matrix is divided into smaller
    clusters.
  • Separately, each cluster is embedded into a local
    centroid.
  • The final predicted structure is generated from
    full embedding of the multiple centroids and
    their corresponding local structures.

48
Lysozyme 4lzm
PDB
DRAGON
49
Methyltransferase 1sfe
PDB
DRAGON
50
Phosphatase 2hhm-A
PDB
DRAGON
51
Taylor method (1999)DOMAIN-3D
  • 3. Assign domain boundaries to each of the 3D
    models (Taylor, 1999)
  • Easy and clever method
  • Uses a notion of spin glass theory (disordered
    magnetic systems) to delineate domains in a
    protein 3D structure
  • Steps
  • Take sequence with residue numbers (1..N)
  • Look at neighbourhood of each residue (first
    shell)
  • If (average nghhood residue number gt res no)
    resno resno1
  • else resno resno-1
  • If (convergence) then take regions with identical
    residue number as domains and terminate

Taylor,WR. (1999) Protein structural domain
identification. Protein Engineering 12 203-216
52
Taylor method (1999)
repeat until convergence if 41 lt
(56567889)/5 then Res 41 42 (up 1)
else Res 41 40 (down 1)
5
78
6
41
56
89
53
Taylor method (1999)
initial situation
Iterate until convergence
continuous
discontinuous
54
SNAPDRAGONDomain boundary prediction protocol
using sequence information alone (Richard George)
  • Sum proposed boundary positions within 100 models
    along the length of the sequence, and smooth
    boundaries using a weighted window (assign
    central position)
  • Window score ?1 i l Si Wi
  • Where Wi (p - p-i)/p2 and p ½(n1).
  • It follows that ?l Wi 1

Wi
i
George R.A. and Heringa J.(2002) SnapDRAGON - a
method to delineate protein structural domains
from sequence data, J. Mol. Biol. 316, 839-851.
55
SNAPDRAGON
  • Statistical significance
  • Convert peak scores to Z-scores using
  • z (x-mean)/stdev
  • If z gt 2 then assign domain boundary
  • Statistical significance using random models
  • Test hydrophibic collapse given distribution of
    hydrophobicity over sequence
  • Make 5 scrambled multiple alignments (MSAs) and
    predict their secondary structure
  • Make 100 models for each MSA
  • Compile mean and stdev from the boundary
    distribution over the 500 random models
  • If observed peak z gt 2.0 stdev (from random
    models) then assign domain boundary

56
SnapDRAGON prediction assessment
  • Test set of 414 multiple alignments183 single
    and 231 multiple domain proteins.
  • Boundary predictions are compared to the region
    of the protein connecting two domains (maximally
    ?10 residues from true boundary)

57
SnapDRAGON prediction assessment
  • Baseline method I
  • Divide sequence in equal parts based on number of
    domains predicted by SnapDRAGON
  • Baseline method II
  • Similar to Wheelan et al., based on domain length
    partition density function (PDF)
  • PDF derived from 2750 non-redundant structures
    (deposited at NCBI)
  • Given sequence, calculate probability of
    one-domain, two-domain, .., protein
  • Highest probability taken and sequence split
    equally as in baseline method I

58
Average prediction results per protein
Coverage is the linkers predicted
(TP/TPFN) Success is the of correct
predictions made (TP/TPFP)
59
Average prediction results per protein
60
Protein-protein interaction networks
61
Protein Function Prediction
  • How can we get the edges (connections) of the
    cellular networks?
  • We can predict functions of genes or proteins so
    we know where they would fit in a metabolic
    network
  • There are also techniques to predict whether two
    proteins interact, either functionally (e.g. they
    are involved in a two-step metabolic process) or
    directly physically (e.g. are together in a
    protein complex)

62
Protein Function Prediction
The state of the art its not complete Many
genes are not annotated, and many more are
partially or erroneously annotated. Given a
genome which is partially annotated at best, how
do we fill in the blanks? Of each sequenced
genome, 20-50 of the functions of proteins
encoded by the genomes remains unknown! How then
do we build a reasonably complete networks when
the parts list is so incomplete?
63
Protein Function Prediction
For all these reasons, improving automated
protein function prediction is now a cornerstone
of bioinformatics and computational biology New
methods will need to integrate signals coming
from sequence, expression, interaction and
structural data, etc.
64
Classes of function prediction methods (recap)
  • Sequence based approaches
  • protein A has function X, and protein B is a
    homolog (ortholog) of protein A Hence B has
    function X
  • Structure-based approaches
  • protein A has structure X, and X has so-so
    structural features Hence As function sites are
    .
  • Motif-based approaches
  • a group of genes have function X and they all
    have motif Y protein A has motif Y Hence
    protein As function might be related to X
  • Function prediction based on guilt-by-association
  • gene A has function X and gene B is often
    associated with gene A, B might have function
    related to X

65
Phylogenetic profile analysis
  • Function prediction of genes based on
    guilt-by-association a non-homologous
    approach
  • The phylogenetic profile of a protein is a string
    that encodes the presence or absence of the
    protein in every sequenced genome
  • Because proteins that participate in a common
    structural complex or metabolic pathway are
    likely to co-evolve, the phylogenetic profiles of
    such proteins are often similar'
  • This means that such proteins have a good chance
    of being physically or metabolically connected

66
Phylogenetic profile analysis
  • Phylogenetic profile (against N genomes)
  • For each gene X in a target genome (e.g., E
    coli), build a phylogenetic profile as follows
  • If gene X has a homolog in genome i, the ith bit
    of Xs phylogenetic profile is 1 otherwise it
    is 0

67
Phylogenetic profile analysis
  • Example phylogenetic profiles based on 60
    genomes

genome
gene
orf1034111011011001011111010001010000000011110001
1111110110111010101 orf10361011110001000001010000
010010000000010111101110011011010000101 orf103711
01100110000001110010000111111001101111101011101111
000010100 orf103811101001100100101100100111000001
01110101101111111111110000101 orf1039111111111111
1111111111111111111111111111101111111111111111101
orf104 10001010000000000000001010000000001100000
00000000100101000100 orf1040111011111111110111110
1111100000111111100111111110110111111101 orf10411
11111111111111111011111111111110111111110111111111
1111111101 orf10421110100101010010010110000100001
001111110111110101101100010101 orf104311101001100
10000010100111100100001111110101111011101000010101
orf104411111001111100100101110101111110011111111
11111101101100010101 orf1045111111011011001111111
1111111111101111111101111111111110010101 orf10460
10110000001000101100000011111000001010000000101001
0100000000 orf10470000000000000001000010000001000
100000000000000010000000000000 orf105
01101101101000101111011010101110011011001011111000
10000010001 orf1054010010011000000110000100010000
0000100100100001000100100000000
By correlating the rows (open reading frames
(ORF) or genes) you find out about joint presence
or absence of genes this is a signal for a
functional connection
Genes with similar phylogenetic profiles have
related functions or functionally linked D
Eisenberg and colleagues (1999)
68
Phylogenetic profile analysis
  • Phylogenetic profiles contain great amount of
    functional information
  • Phlylogenetic profile analysis can be used to
    distinguish orthologous genes from paralogous
    genes
  • Example Subcellular localization 361 yeast
    nucleus-encoded mitochondrial proteins were
    identified at 50 accuracy with 58 coverage
    through phylogenetic profile analysis
  • Functional complementarity By examining inverse
    phylogenetic profiles, one can find functionally
    complementary genes that might have evolved
    through one of several mechanisms of convergent
    evolution.
  • Phylogenetic profiling typically has low accuracy
    (specificity) but can have high coverage.

69
Domain fusion example
  • Vertebrates have a multi-enzyme protein
    (GARs-AIRs-GARt) comprising the enzymes GAR
    synthetase (GARs), AIR synthetase (AIRs), and GAR
    transformylase (GARt)
  • In insects, the polypeptide appears as
    GARs-(AIRs)2-GARt
  • In yeast, GARs-AIRs is encoded separately from
    GARt
  • In bacteria each domain is encoded separately
    (Henikoff et al., 1997).
  • GAR glycinamide ribonucleotide
  • AIR aminoimidazole ribonucleotide

70
Using observed domain fusion for prediction of
protein-protein interactions Rosetta stone
method
  • Gene fusion is the an effective method for
    prediction of protein-protein interactions
  • If proteins A and B are homologous to two domains
    of a multi-domain protein C, A and B are
    predicted to have interaction

A
B
C
Though gene-fusion has low prediction coverage,
its false-positive rate is low (high specificity)
71
Protein interaction database
  • There are numerous databases of protein-protein
    interactions
  • DIP is a popular protein-protein interaction
    database

The DIP database catalogs experimentally
determined interactions between proteins. It
combines information from a variety of sources to
create a single, consistent set of
protein-protein interactions.
72
Protein interaction databases
  • BIND - Biomolecular Interaction Network Database
  • DIP - Database of Interacting Proteins
  • PIM Hybrigenics
  • PathCalling Yeast Interaction Database
  • MINT - a Molecular Interactions Database
  • GRID - The General Repository for Interaction
    Datasets
  • InterPreTS - protein interaction prediction
    through tertiary structure
  • STRING - predicted functional associations among
    genes/proteins
  • Mammalian protein-protein interaction database
    (PPI)
  • InterDom - database of putative interacting
    protein domains
  • FusionDB - database of bacterial and archaeal
    gene fusion events
  • IntAct Project
  • The Human Protein Interaction Database (HPID)
  • ADVICE - Automated Detection and Validation of
    Interaction by Co-evolution
  • InterWeaver - protein interaction reports with
    online evidence
  • PathBLAST - alignment of protein interaction
    networks
  • ClusPro - a fully automated algorithm for
    protein-protein docking
  • HPRD - Human Protein Reference Database

73
Protein interaction database
74
Network of protein interactions and predicted
functional links involving silencing information
regulator (SIR) proteins. Filled circles
represent proteins of known function open
circles represent proteins of unknown function,
represented only by their Saccharomyces genome
sequence numbers ( http//genome-www.stanford.edu/
Saccharomyces). Solid lines show experimentally
determined interactions, as summarized in the
Database of Interacting Proteins19
(http//dip.doe-mbi.ucla.edu). Dashed lines show
functional links predicted by the Rosetta Stone
method12. Dotted lines show functional links
predicted by phylogenetic profiles16. Some
predicted links are omitted for clarity.
75
Network of predicted functional linkages
involving the yeast prion protein20 Sup35. The
dashed line shows the only experimentally
determined interaction. The other functional
links were calculated from genome and expression
data11 by a combination of methods, including
phylogenetic profiles, Rosetta stone linkages and
mRNA expression. Linkages predicted by more than
one method, and hence particularly reliable, are
shown by heavy lines. Adapted from ref. 11.  
76
STRING - predicted functional associations among
genes/proteins
  • STRING is a database of predicted functional
    associations among genes/proteins.
  • Genes of similar function tend to be maintained
    in close neighborhood, tend to be present or
    absent together, i.e. to have the same
    phylogenetic occurrence, and can sometimes be
    found fused into a single gene encoding a
    combined polypeptide.
  • STRING integrates this information from as many
    genomes as possible to predict functional links
    between proteins.

Berend Snel (UU), Martijn Huynen (RUN) and the
group of Peer Bork (EMBL, Heidelberg)
77
STRING - predicted functional associations among
genes/proteins
  • STRING is a database of known and predicted
    protein-protein interactions.The interactions
    include direct (physical) and indirect
    (functional) associations they are derived from
    four sources
  • Genomic Context (Synteny)
  • High-throughput Experiments 
  • (Conserved) Co-expression 
  • Previous Knowledge
  • STRING quantitatively integrates interaction
    data from these sources for a large number of
    organisms, and transfers information between
    these organisms where applicable. The database
    currently contains 736429 proteins in 179 species

78
STRING - predicted functional associations among
genes/proteins
Conserved Neighborhood This view shows
runs of genes that occur repeatedly in close
neighborhood in (prokaryotic) genomes. Genes
located together in a run are linked with a black
line (maximum allowed intergenic distance is 300
bp). Note that if there are multiple runs for a
given species, these are separated by white
space. If there are other genes in the run that
are below the current score threshold, they are
drawn as small white triangles. Gene fusion
occurences are also drawn, but only if they are
present in a run.
79
Wrapping up
  • Understand chymotrypsin example evolution via
    gene duplication of an optimised two-domain
    barrel enzyme with active site residues from
    either domain.
  • Understand domain issues structural and
    functional
  • Understand the basic steps of the Snap-DRAGON
    method for domain boundary prediction but no
    need to memorize it all
  • Understand phylogenetic profiling and the Rosetta
    Stone method (guilt-by-association)
  • Understand that conservation patterns in the
    order of genes that are nearby on the genome
    (synteny) indicate functional relationships (used
    in STRING method)
  • Also co-expression (genes being expressed (or
    not) at the same time) indicates a functional
    relationship (used in STRING method)
Write a Comment
User Comments (0)
About PowerShow.com