Title: Protein Fold recognition
1Protein Fold recognition
- Morten Nielsen,
- CBS, BioCentrum,
- DTU
2Outline
- Why model protein structure
- Classification of protein structures
- Fold, Superfamily, Family
- Protein homology modeling
- Template (fold) recognition
- Alignment
- Side chain modeling
- Loop modeling
- Reliability measures
- id bad, P-value good
- Historical overview
- Blast (simple alignment)
- Psi Blast (profiles)
- Profile-profile alignment
- Structural features
- Recombinant or democratic homology modeling
- Best methods
3Why protein modeling?
- Experimental effort to determine protein
structure is very large and costly
- The gap between the size of the protein sequence
data and protein structure data is large and
increasing
- Close to 50 of all new sequences can be homology
modeled
4Swiss-Prot database
5PDB New Fold Growth
Old folds
New PDB structures
New folds
- The number of unique folds in nature is fairly
small (possibly a few thousands)
- 90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB
6Protein classification
- Number of protein sequences grow exponentially
- Number of solved structures grow exponentially
- Number of new folds identified very small (and
close to constant)
- Protein classification can
- Generate overview of structure types
- Detect similarities (evolutionary relationships)
between protein sequences
7Protein structure classification
8Classification schemes
- SCOP
- Manual classification (A. Murzin)
- CATH
- Semi manual classification (C. Orengo)
- FSSP
- Automatic classification (L. Holm)
9Levels in SCOP
- Class Folds Superfamilies Families
- All alpha proteins 202 342 550
- All beta proteins 141 280 529
- Alpha and beta proteins (a/b) 130 213 593
- Alpha and beta proteins (ab) 260 386 650
- Multi-domain proteins 40 40 55
- Membrane and cell surface
- proteins 42 82 91
- Small proteins 72 104 162
- Total 887 1447 2630
http//scop.berkeley.edu/count.htmlscop-1.67
10Major classes in SCOP
- Classes
- All alpha proteins
- Alpha and beta proteins (a/b)
- Alpha and beta proteins (ab)
- Multi-domain proteins
- Membrane and cell surface proteins
- Small proteins
-
11All a Hemoglobin (1bab)
12All b Immunoglobulin (8fab)
13a/b Triosephosphate isomerase (1hti)
14ab Lysozyme (1jsf)
15Families
- Proteins whose evolutionarily relationship is
readily recognizable from the sequence (25
sequence identity)
- Families are further subdivided in to Proteins
- Proteins are divided into Species
- The same protein may be found in several species
16Superfamilies
- Proteins which are (remote) evolutionarily
related
- Sequence similarity low
- Share function
- Share special structural features
- Relationships between members of a superfamily
may not be readily recognizable from the sequence
alone
17Folds
- Proteins which have 50 of their secondary
structure elements arranged the in the same order
in the protein chain and in three dimensions are
classified as having the same fold - No evolutionary relation between proteins
-
- confusingly also called fold classes
18Links
- PDB (protein structure database)
- www.rcsb.org/pdb/
- SCOP (protein classification database)
- scop.berkeley.edu
- CATH (protein classification database)
- www.biochem.ucl.ac.uk/bsm/cath
- FSSP (protein classification database)
- www.ebi.ac.uk/dali/fssp/fssp.html
19Superfamilies
- Proteins which are (remote) evolutionarily
related
- Sequence similarity low
- Share function
- Share special structural features
- Relationships between members of a superfamily
may not be readily recognizable from the sequence
alone
Fold
Superfamily
Family
Proteins
20Model accuracy. Swiss-model. 1200 models sharing
25-95 sequence identity with the submitted
sequences (www.expasy.ch/swissmod)
21Identification of fold
- If sequence similarity is high proteins share
structure (Safe zone)
- If sequence similarity is low proteins may share
structure (Twilight zone)
- Most proteins do not have a high sequence
homologous partner
Rajesh Nair Burkhard Rost Protein Science,
2002, 11, 2836-47
22Identification of correct fold
- ID is a poor measure
- Many evolutionary related proteins share low
sequence homology
- Alignment score even worse
- Many sequence will score high against every thing
(hydrophobic stretches)
- P-value or E-value more reliable
23P and E values
Score 150 10 hits with higher score (E10) 10000
hits in database P10/10000 0.001
- E-value
- Number of expected hits in database with score
higher than match
- Depends on database size
- P-value
- Probability that a random hit will have score
higher than match
- Database size independent
P(Score)
Score
24Protein Homology modeling
- Identify fold (template) for modeling
- Find the structure in the PDB database that
resembles the unknown structure the most
- Can be used to predict function
- Align protein sequence to template
- Simple alignment methods
- Sequence profiles
- Threading methods
- Pseudo force fields
- Model side chains and loops
25Template identification
- Simple sequence based methods
- Align (BLAST) sequence against sequence of
proteins with known structure (PDB database)
- Sequence profile based methods
- Align sequence profile (Psi-BLAST) against
sequence of proteins with known structure (PDB)
- Align sequence profile against profile of
proteins with known structure (FFAS)
- Sequence and structure based methods
- Align profile and predicted secondary structure
against proteins with known structure (3D-PSSM)
26Template identification
- Threading methods
- Align sequence against structural environment of
proteins with known structure
- Use biological information
- Functional annotation in databases
- Active sites
27Sequence profiles
- In conventional alignment, a scoring matrix
(BLOSUM62) gives the score for matching two amino
acids
- In reality not all positions in a protein are
equally likely to mutate
- Some amino acids (active cites) are highly
conserved, and the score for mismatch must be
very high
- Other amino acids are mutate almost for free, and
the score for mismatch is lower than the BLOSUM
score
- Sequence profiles can capture this
28Sequence profiles
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASK
ISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHW
HGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTV
TIT-----NIDQIED-VSHGFVVVNHGVSME---I
IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFR
GEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------
IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFV
PTHK-SHNAATIDGMVPEGVEGFKSRINDE----
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWAD
GPAYVTQCPI
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVD
PMERNTAGVP
Matching any thing but G large negative score
Any thing can match
29Sequence profiles
- Align (BLAST) sequence against large sequence
database (Swiss-Prot)
- Select significant alignments and make profile
(weight matrix) using techniques for sequence
weighting and pseudo counts (see lecture on
HMMs) - Use weight matrix to align against sequence
database to find new significant hits
- Repeat 2 and 3 until stop criteria
30PDB-BLAST
- Procedure
- Build sequence profile by iterative PSI-BLAST
search against a sequence database
- Use profile to search database of proteins with
known structure (PDB)
31Transitive BLAST
- Procedure
- Find homologues to query (your) sequence
- Find homologues to these homologues
- Etc.
- Can be implemented with e.g. BLAST or PSI-BLAST
- Also known as Intermediate Sequence Search (ISS)
32ExampleSequence profiles
- Alignment of protein sequences 1PLC._ and 1GYC.A
- E-value 1000
- Profile alignment
- Align 1PLC._ against Swiss-prot
- Make position specific weight matrix from
alignment
- Use this matrix to align 1PLC._ against 1GYC.A
- E-value
33Sequence profiles
Score 97.1 bits (241), Expect 9e-22
Identities 13/107 (12), Positives 27/107
(25), Gaps 17/107 (15) 1PLC._ 3 ADDGSLA
FVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS
56 F G N
G 1GYC.A 26 ------VFPSPLITGKK
GDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79
1PLC._ 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--
HQGAGMVGKVTV 98 A G F
G G G V 1GYC.A 80 AFVNQCPIAS
GHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126
Rmsd3.3 Å Model red Template blue
34Including structure
- Sequence with in a protein superfamily share
remote sequence homology
- , but they share high structural homology
- Structure is known for template
- Predict structural properties for query
- Secondary structure
- Surface exposure
35Using structure
- Sequencestructure profile-profile based
alignments
- Template profiles
- Multiple structure alignments
- Sequence based profiles
- Query profile
- Sequence based profile
- Predicted secondary structure
- Position specific gap penalties derived from
secondary structure
36Structure biased alignment (3D-PSSM)
http//www.sbg.bio.ic.ac.uk/3dpssm/
37Threading
Alignment score from structural fitness (pair
potential) How well does K fit environment at P6?
If P8 is acidic then fine, if P8 is basic th
en poor
Deletions
7
4
6
2
5
8
10
9
3
1
.. A T N L Y K E T L ..
Insertion
38Threading
- Threading does not work
- The average protein does not exist
- Threading can be used in combination with
sequence profiles, local structural features to
improve alignment
39CASP
- CASP
- Critical Assessment of Structure Predictions
- Every second year
- Sequences from about-to-be-solved-structures are
given to groups who submit their predictions
before the structure is published
- Modelers make prediction
- Meeting in December where correct answers are
revealed
40CASP5 overview
41Successful fold recognition groups at CASP5
- 3D-Jury (Leszek Rychlewski)
- 3D-CAM (Krzysztof Ginalski)
- Template recombination (Paul Bates)
- HMAP (Barry Honig)
- PROSPECT (Ying Xu)
- ATOME (Gilles Labesse)
42Democratic homology modeling
- Let the silent majority rule
- The highest score hit will often be wrong
- Many prediction methods will have the correct
fold among the top 10-20 hits
- If many different prediction methods all have
some fold among the top hits, this fold is
probably correct
433D-Jury (Rychlewski)
- Inspired by Ab initio modeling methods
- Average of frequently obtained low energy
structures is often closer to the native
structure than the lowest energy structure
- Find most abundant high scoring model in a list
of prediction from several predictors
- Use output from a set of servers
- Superimpose all pairs of structures
- Similarity score Sij of Ca pairs within 3.5Å
(if 40else Sij0)
- 3D-Jury score SijSij/(N1)
- Similar methods developed by A Elofsson (Pcons)
and D Fischer (3D shotgun)
44LiveBench
- The Live Bench Project is a continuous
benchmarking program. Every week sequences of
newly released PDB proteins are being submitted
to participating fold recognition servers. The
results are collected and continuous evaluated
using automated model assessment programs. A
summary of the results is produced after several
months of data collection. The servers must delay
the updating of their structural template
libraries by one week to participate
45Meta prediction server
- Web interface to a list of public protein
structure prediction servers
- Submit query sequence to all selected servers in
one go
- http//bioinfo.pl/meta/
46Meta Server
47Meta Server
483D Jury
49188 targets in total Threshold for 5 false positi
ves 50 for 3D Jury
50Links to fold recognition servers
- Databases of links
- http//bioinfo.pl/meta/servers.html
- http//mmtsb.scripps.edu/cgi-bin/renderrelres?prot
model
- Meta server
- http//bioinfo.pl/meta/
- 3DPSSM good graphical output
- http//www.sbg.bio.ic.ac.uk/servers/3dpssm/
- GenTHREADER
- http//bioinf.cs.ucl.ac.uk/psipred/
- FUGUE2
- http//www-cryst.bioc.cam.ac.uk/fugue/prfsearch.h
tml
- SAM
- http//www.cse.ucsc.edu/research/compbio/HMM-apps/
T99-query.html
- FOLD
- http//fold.doe-mbi.ucla.edu/
- FFAS/PDBBLAST
- http//bioinformatics.burnham-inst.org/
51From fold to structure
- Flying to the moon has not made man conquer
space
- Finding the right fold does not allow you to make
accurate protein models
- Can allow prediction of protein function
- Alignment is still a very hard problem
- Most protein interactions are determined by the
loops, and they are the least conserved parts of
a protein structure
52Ab initio protein modeling Modeling of newfold pr
oteins
- Only when every thing else fails
- Challenge
- Close to impossible to model Natures folding
potential
- Example
53Challenge. Folding potential
- New folds are in general constructed from a set
of subunits, where each subunit is a part of a
known fold.
- The subunits are small compared to the overall
fold of the protein. No objective function exists
to guide the global packing of the subunits.
Objective function
sij 120aa
dij 6Å
54A way to solution
- Glue structure piece wise from fragments.
- Guide process by empirical potential (Potential
of mean force)
Fragments with correct local structure
Natures potential
Empirical potential
55Examples (Rosetta web server) www.bioinfo.rpi.edu/
bystrc/hmmstr/server.php
Rosetta prediction
Homology modeling
56Take home message
- Identifying the correct fold is only a small step
towards successful homology modeling
- Do not trust ID or alignment score to identify
the fold. Use p-values
- Use sequence profiles and local protein structure
to align sequences
- Do not trust one single prediction method, use
consensus methods (3D Jury)
- Only if every things fail, use ab initio methods