Protein Fold recognition - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Protein Fold recognition

Description:

A protein fold is the scaffold that can be used as a template to model a query protein sequence. ... Only if everythings fail, use ab initio methods ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 42

Provided by: cbs6

Category:

more less

Transcript and Presenter's Notes

Title: Protein Fold recognition

1
Protein Fold recognition

Morten Nielsen,
Thomas Nordahl
CBS, BioCentrum,
DTU

2
IntroductionWhat is a protein fold

Protein fold
Protein sequence id
Protein sequence/structure databases
Alignment values
Scores, E-values P-values
Protein classifications
Fold, Superfamily, Family protein

3
IntroductionWhat is a protein fold

A protein fold is the scaffold that can be used
as a template to model a query protein sequence.
Fold recognition is technique that is used to
identify the scaffold to be used, from a known
protein structure. The sequence similarity is low
and therefore the fold is difficult to recognize
by use of simple sequence alignment tools
(blosum62 matrix).

4
Outline

Many textbooks and experts state that ID is the
only determining factor for successful homology
modeling
This is WRONG!
ID is a very poor measure to determine if a
protein can be modeled
Many sequences with sequence homology 10-15 can
be accurately modeled

5
Outline

Why homology modeling
How is it done
How to decide when to use homology modeling
Why is id such a terrible measure
What are the best methods

6
Why protein modeling?

Because it works!
Close to 50 of all new sequences can be homology
modeled
Experimental effort to determine protein
structure is very large and costly
The gap between the size of the protein sequence
data and protein structure data is large and
increasing

7
Homology modeling and the human genome
Human genome 30.000 proteins
8
Swiss-Prot database
9
PDB New Fold Growth
Old folds
New PDB structures
New folds
10
PDB New Fold Growth
New PDB structures
11
PDB New Fold Growth
New PDB structures
12
Identification of fold
Rajesh Nair Burkhard Rost Protein Science,
2002, 11, 2836-47
13
Why id is so bad!!
1200 models sharing 25-95 sequence identity with
the submitted sequences (www.expasy.ch/swissmod)
14
Identification of correct fold

ID is a poor measure
Many evolutionary related proteins share low
sequence homology
Alignment score even worse
Many sequences will score high against every
thing (hydrophobic stretches)
P-value or E-value more reliable

15
What are P and E values?

E-value
Number of expected hits in database with score
higher than match
Depends on database size
P-value
Probability that a random hit will have score
higher than match
Database size independent

16
Protein classifications
17
Protein structure classification
18
Superfamilies

Proteins which are (remote) evolutionarily
related
Sequence similarity low
Share function
Share special structural features
Same evolutionary ancestor
Relationships between members of a superfamily
may not be readily recognizable from the sequence
alone

Fold
Superfamily
Family
Proteins
19
Template identification

Simple sequence based methods
Align (BLAST) sequence against sequence of
proteins with known structure (PDB database)
Sequence profile based methods
Align sequence profile (Psi-BLAST) against
sequence of proteins with known structure (PDB)
Align sequence profile against profile of
proteins with known structure (FFAS)
Sequence and structure based methods
Align profile and predicted secondary structure
against proteins with known structure (3D-PSSM)

20
Sequence profiles

In conventional alignment, a scoring matrix
(BLOSUM62) gives the score for matching two amino
acids
In reality not all positions in a protein are
equally likely to mutate
Some amino acids (active cites) are highly
conserved, and the score for mismatch must be
very high
Other amino acids can mutate almost for free, and
the score for mismatch is lower than the BLOSUM
score
Sequence profiles (just like a HMM) can capture
these differences

21
Sequence profiles/blosum62 scores
a)TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNL
VDPMERNTAGVP
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWAD
GPAYVTQCPI
b)TKAVVLTFNTSVEICLVMQ-GTSIVAAESHPLHLHGFNFPSNFNLVDP
MERNTAGVP
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWAD
GPAYVTQCPI
Which alignment is most correct a) or b)
? Blosum62 scores G-G 6 H-H 8
22
Blosum scoring matrix
A R N D C Q E G H I L K M F P
S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1
-1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0
-2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6
1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2
-3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1
0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1
-3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2
-2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0
2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2
-2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2
0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3
-1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3
-4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3
-4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1
1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1
0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3
-3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2
-1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3
-2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1
-1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3
-2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2
-3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
-1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2
-2 0 -3 -1 4
23
Sequence profiles
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASK
ISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWH
GLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTI
T-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYT
IKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVA
PSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAM
E---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPE
GVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAE
NITIHWHGVQLGTGWADGPAYVTQCPI
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWAD
GPAYVTQCPI
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVD
PMERNTAGVP
Matching any thing but G gt large negative score
Any thing can match
24
Sequence profiles

Align (BLAST) sequence against large sequence
database (Swiss-Prot)
Select significant alignments and make profile
(weight matrix) using techniques for sequence
weighting and pseudo counts
Use weight matrix to align against sequence
database to find new significant hits
Repeat 2 and 3 (normally 3 times!)

25
Example. Sequence profiles

Alignment of protein sequences 1PLC._ and 1GYC.A
E-value gt 1000
Profile alignment
Align 1PLC._ against Swiss-prot
Make position specific weight matrix from
alignment
Use this matrix to align 1PLC._ against 1GYC.A
E-value lt 10-22. Rmsd3.3

26
Sequence profiles

Score 97.1 bits (241), Expect 9e-22
Identities 13/107 (12), Positives 27/107
(25), Gaps 17/107 (15)
1PLC._ 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHN
IVFDEDSIPSGVDASKIS 56
F G N
G
1GYC.A 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKST
SIHWHGFFQAGTNWADGP 79
1PLC._ 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQG
AGMVGKVTV 98
A G F G
G G V
1GYC.A 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYC
DGLRGPFVV 126

Rmsd3.3 Å Structure red Template blue
27
Sequence logo / Sequence profile
0 iterations (Blosum62)
1 iterations
3 iterations
2 iterations
28
Profile-profile alignment
Query
Template
Compare amino acid preference for the two
proteins and pair similar positions (HHpred)
29
Including structure

Sequence within a protein superfamily share
remote sequence homology
, but they share high structural homology
Structure is known for template
Predict structural properties for query
Secondary structure
Surface exposure
Position specific gap penalties derived from
secondary structure and surface exposure

30
Using structure

Sequencestructure profile-profile based
alignments
Template profiles
Multiple structure alignments
Sequence based profiles
Query profile
Sequence based profile
Predicted secondary structure
Position specific gap penalties derived from
secondary structure

31
CASP. Which are the best methods

Critical Assessment of Structure Predictions
Every second year
Sequences from about-to-be-solved-structures are
given to groups who submit their predictions
before the structure is published
Modelers make prediction
Meeting in December where correct answers are
revealed

32
CASP6 results
33
The top 4 homology modeling groups in CASP6

All winners use consensus predictions
The wisdom of the crowd
Same approach as in CASP5!
Nothing has happened in 2 years!

34
The wisdom of the crowd!

Why the many are smarter than the few
A general method useful to improve prediction
accuracy
No single method or expert will always be the
best

35
The wisdom of the crowd!

The highest scoring hit will often be wrong
Not one single prediction method is consistently
best
Many prediction methods will have the correct
fold among the top 10-20 hits
If many different prediction methods all have
same fold among the top hits, this fold is
probably correct

36
How to do it? Where is the crowd

Meta prediction server
Web interface to a list of public protein
structure prediction servers
Submit query sequence to all selected servers in
one go
http//bioinfo.pl/meta/

37
Meta Server
38
From fold to structure

Flying to the moon has not made man conquer space
Finding the right fold does not allow you to make
accurate protein models
Can allow prediction of protein function
Alignment is still a very hard problem
Most protein interactions are determined by the
loops, and they are the least conserved parts of
a protein structure

39
Ab initio protein modeling
Modeling of new protein folds

Only when everything else fails
Challenge
Close to impossible to model Natures folding
potential

40
A way to solution

Glue structure piece wise from fragments.
Guide process by empirical/statistical potential

Fragments with correct local structure
Natures potential
Empirical potential
41
Example (Rosetta web server)
www.bioinfo.rpi.edu/bystrc/hmmstr/server.php
Rosetta prediction
Structure
42
Take home message

Identifying the correct fold is only a small step
towards successful homology modeling
Do not trust ID or alignment score to identify
the fold. Use p-values
Use sequence profiles and local protein structure
to align sequences
Do not trust one single prediction method, use
consensus methods (3D Jury)
Only if everythings fail, use ab initio methods

Write a Comment

User Comments (0)