Protein Fold recognition - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Protein Fold recognition

Description:

A protein fold is the scaffold that can be used as a template to model a query protein sequence. ... Only if everythings fail, use ab initio methods ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 42
Provided by: cbs6
Category:

less

Transcript and Presenter's Notes

Title: Protein Fold recognition


1
Protein Fold recognition
  • Morten Nielsen,
  • Thomas Nordahl
  • CBS, BioCentrum,
  • DTU

2
IntroductionWhat is a protein fold
  • Protein fold
  • Protein sequence id
  • Protein sequence/structure databases
  • Alignment values
  • Scores, E-values P-values
  • Protein classifications
  • Fold, Superfamily, Family protein

3
IntroductionWhat is a protein fold
  • A protein fold is the scaffold that can be used
    as a template to model a query protein sequence.
  • Fold recognition is technique that is used to
    identify the scaffold to be used, from a known
    protein structure. The sequence similarity is low
    and therefore the fold is difficult to recognize
    by use of simple sequence alignment tools
    (blosum62 matrix).

4
Outline
  • Many textbooks and experts state that ID is the
    only determining factor for successful homology
    modeling
  • This is WRONG!
  • ID is a very poor measure to determine if a
    protein can be modeled
  • Many sequences with sequence homology 10-15 can
    be accurately modeled

5
Outline
  • Why homology modeling
  • How is it done
  • How to decide when to use homology modeling
  • Why is id such a terrible measure
  • What are the best methods

6
Why protein modeling?
  • Because it works!
  • Close to 50 of all new sequences can be homology
    modeled
  • Experimental effort to determine protein
    structure is very large and costly
  • The gap between the size of the protein sequence
    data and protein structure data is large and
    increasing

7
Homology modeling and the human genome
Human genome 30.000 proteins
8
Swiss-Prot database
9
PDB New Fold Growth
Old folds
New PDB structures
New folds
10
PDB New Fold Growth
New PDB structures
11
PDB New Fold Growth
New PDB structures
12
Identification of fold
Rajesh Nair Burkhard Rost Protein Science,
2002, 11, 2836-47
13
Why id is so bad!!
1200 models sharing 25-95 sequence identity with
the submitted sequences (www.expasy.ch/swissmod)
14
Identification of correct fold
  • ID is a poor measure
  • Many evolutionary related proteins share low
    sequence homology
  • Alignment score even worse
  • Many sequences will score high against every
    thing (hydrophobic stretches)
  • P-value or E-value more reliable

15
What are P and E values?
  • E-value
  • Number of expected hits in database with score
    higher than match
  • Depends on database size
  • P-value
  • Probability that a random hit will have score
    higher than match
  • Database size independent

16
Protein classifications
17
Protein structure classification
18
Superfamilies
  • Proteins which are (remote) evolutionarily
    related
  • Sequence similarity low
  • Share function
  • Share special structural features
  • Same evolutionary ancestor
  • Relationships between members of a superfamily
    may not be readily recognizable from the sequence
    alone

Fold
Superfamily
Family
Proteins
19
Template identification
  • Simple sequence based methods
  • Align (BLAST) sequence against sequence of
    proteins with known structure (PDB database)
  • Sequence profile based methods
  • Align sequence profile (Psi-BLAST) against
    sequence of proteins with known structure (PDB)
  • Align sequence profile against profile of
    proteins with known structure (FFAS)
  • Sequence and structure based methods
  • Align profile and predicted secondary structure
    against proteins with known structure (3D-PSSM)

20
Sequence profiles
  • In conventional alignment, a scoring matrix
    (BLOSUM62) gives the score for matching two amino
    acids
  • In reality not all positions in a protein are
    equally likely to mutate
  • Some amino acids (active cites) are highly
    conserved, and the score for mismatch must be
    very high
  • Other amino acids can mutate almost for free, and
    the score for mismatch is lower than the BLOSUM
    score
  • Sequence profiles (just like a HMM) can capture
    these differences

21
Sequence profiles/blosum62 scores
a)TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNL
VDPMERNTAGVP
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWAD
GPAYVTQCPI
b)TKAVVLTFNTSVEICLVMQ-GTSIVAAESHPLHLHGFNFPSNFNLVDP
MERNTAGVP
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWAD
GPAYVTQCPI
Which alignment is most correct a) or b)
? Blosum62 scores G-G 6 H-H 8
22
Blosum scoring matrix
A R N D C Q E G H I L K M F P
S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1
-1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0
-2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6
1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2
-3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1
0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1
-3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2
-2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0
2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2
-2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2
0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3
-1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3
-4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3
-4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1
1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1
0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3
-3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2
-1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3
-2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1
-1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3
-2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2
-3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
-1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2
-2 0 -3 -1 4
23
Sequence profiles
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASK
ISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWH
GLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTI
T-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYT
IKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVA
PSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAM
E---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPE
GVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAE
NITIHWHGVQLGTGWADGPAYVTQCPI
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWAD
GPAYVTQCPI
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVD
PMERNTAGVP
Matching any thing but G gt large negative score
Any thing can match
24
Sequence profiles
  • Align (BLAST) sequence against large sequence
    database (Swiss-Prot)
  • Select significant alignments and make profile
    (weight matrix) using techniques for sequence
    weighting and pseudo counts
  • Use weight matrix to align against sequence
    database to find new significant hits
  • Repeat 2 and 3 (normally 3 times!)

25
Example. Sequence profiles
  • Alignment of protein sequences 1PLC._ and 1GYC.A
  • E-value gt 1000
  • Profile alignment
  • Align 1PLC._ against Swiss-prot
  • Make position specific weight matrix from
    alignment
  • Use this matrix to align 1PLC._ against 1GYC.A
  • E-value lt 10-22. Rmsd3.3

26
Sequence profiles
  • Score 97.1 bits (241), Expect 9e-22
  • Identities 13/107 (12), Positives 27/107
    (25), Gaps 17/107 (15)
  • 1PLC._ 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHN
    IVFDEDSIPSGVDASKIS 56
  • F G N
    G
  • 1GYC.A 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKST
    SIHWHGFFQAGTNWADGP 79
  • 1PLC._ 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQG
    AGMVGKVTV 98
  • A G F G
    G G V
  • 1GYC.A 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYC
    DGLRGPFVV 126

Rmsd3.3 Ã… Structure red Template blue
27
Sequence logo / Sequence profile
0 iterations (Blosum62)
1 iterations
3 iterations
2 iterations
28
Profile-profile alignment
Query
Template
Compare amino acid preference for the two
proteins and pair similar positions (HHpred)
29
Including structure
  • Sequence within a protein superfamily share
    remote sequence homology
  • , but they share high structural homology
  • Structure is known for template
  • Predict structural properties for query
  • Secondary structure
  • Surface exposure
  • Position specific gap penalties derived from
    secondary structure and surface exposure

30
Using structure
  • Sequencestructure profile-profile based
    alignments
  • Template profiles
  • Multiple structure alignments
  • Sequence based profiles
  • Query profile
  • Sequence based profile
  • Predicted secondary structure
  • Position specific gap penalties derived from
    secondary structure

31
CASP. Which are the best methods
  • Critical Assessment of Structure Predictions
  • Every second year
  • Sequences from about-to-be-solved-structures are
    given to groups who submit their predictions
    before the structure is published
  • Modelers make prediction
  • Meeting in December where correct answers are
    revealed

32
CASP6 results
33
The top 4 homology modeling groups in CASP6
  • All winners use consensus predictions
  • The wisdom of the crowd
  • Same approach as in CASP5!
  • Nothing has happened in 2 years!

34
The wisdom of the crowd!
  • Why the many are smarter than the few
  • A general method useful to improve prediction
    accuracy
  • No single method or expert will always be the
    best

35
The wisdom of the crowd!
  • The highest scoring hit will often be wrong
  • Not one single prediction method is consistently
    best
  • Many prediction methods will have the correct
    fold among the top 10-20 hits
  • If many different prediction methods all have
    same fold among the top hits, this fold is
    probably correct

36
How to do it? Where is the crowd
  • Meta prediction server
  • Web interface to a list of public protein
    structure prediction servers
  • Submit query sequence to all selected servers in
    one go
  • http//bioinfo.pl/meta/

37
Meta Server
38
From fold to structure
  • Flying to the moon has not made man conquer space
  • Finding the right fold does not allow you to make
    accurate protein models
  • Can allow prediction of protein function
  • Alignment is still a very hard problem
  • Most protein interactions are determined by the
    loops, and they are the least conserved parts of
    a protein structure

39
Ab initio protein modeling
Modeling of new protein folds
  • Only when everything else fails
  • Challenge
  • Close to impossible to model Natures folding
    potential

40
A way to solution
  • Glue structure piece wise from fragments.
  • Guide process by empirical/statistical potential

Fragments with correct local structure
Natures potential
Empirical potential
41
Example (Rosetta web server)
www.bioinfo.rpi.edu/bystrc/hmmstr/server.php
Rosetta prediction
Structure
42
Take home message
  • Identifying the correct fold is only a small step
    towards successful homology modeling
  • Do not trust ID or alignment score to identify
    the fold. Use p-values
  • Use sequence profiles and local protein structure
    to align sequences
  • Do not trust one single prediction method, use
    consensus methods (3D Jury)
  • Only if everythings fail, use ab initio methods
Write a Comment
User Comments (0)
About PowerShow.com