Title: Protein Structural Prediction
1Protein Structural Prediction
2Structure Determines Function
The Protein Folding Problem
- What determines structure?
- Energy
- Kinematics
- How can we determine structure?
- Experimental methods
- Computational predictions
3Protein Structure Prediction
- ab initio
- Use just first principles energy, geometry, and
kinematics - Homology
- Find the best match to a database of sequences
with known 3D-structure - Threading
- Meta-servers and other methods
4Threading
MTYKLILN . NGVDGEWTYTE
Main difference between homology-based prediction
and threading Threading uses the structure to
compute energy function during alignment
- Threading is the golden mean between
homology-based prediction and molecular modeling
(?)
5Threading Overview
- Build a structural template database
- Define a sequencestructure energy function
- Apply a threading algorithm to query sequence
- Perform local refinement of secondary structure
- Report best resulting structural model
6Threading Search Space
Protein Sequence X
Protein Structure Y
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
7Threading Template Database
- FSSP, SCOP, CATH
- Remove pairs of proteins with highly similar
structures - Efficiency
- Statistical skew in favor of large families
8Threading Energy Function
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
how well a residue fits a structural
environment Es
how preferable to put two particular residues
nearby Ep
how often a residue mutates to the template
residue Em
alignment gap penalty Eg
compatibility with local secondary structure
prediction Ess
total energy wmEm wsEs wpEp wgEg
wssEss
9Threading Formulation
x
- Contact graph captures amino acid interactions
- Cores represent important local structure units
- No gaps within each core
y
z
u
Ci
v
Cj
x
z
y
C1
C2
C3
C4
u
v
a
t1a
?1
?0
t4a
?4
?3
t3a
?2
t2a
10Threading Formulation
CMG (v, ?)
11Threading Formulation
12Threading Search Space
Protein Sequence X
Protein Structure Y
How Hard is Threading?
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
CORES
13How Hard is Threading?
- At least as hard as MAX-CUT
- MAX-CUT Given graph G (V, E), find a cut (S,
T) of V with maximum number of edges between S
and T. - The Bad News APX-complete even when each node
has at most B edges (where Bgt2)
14Reduction of MAX-CUT to Threading
0 1 0 1 0 1 0 1 0 1 0 1 0 1 v1 v2 v3
v4 v5 v6 v7
Sequence consists of V 01-pairs
- V cores, each core i has length 1 and
corresponds to vi - Let Ep(0,1) 1 every edge labeled 0-1 or 1-0
gets a score of 1 - Then, size of cut threading score
15Threading with Branch Bound
- Set of solutions can be partitioned into subsets
(branch) - Upper limit on a subsets solution can be
computed fast (bound) - Branch Bound
- Select subset with best possible bound
- Subdivide it, and compute a bound for each subset
16Threading with Branch Bound
- Key to this algorithm is tradeoff on lower bound
- efficient
- tight
17Threading with Integer Programming
maximize
z 6x5y
Linear function
Subject to
Linear Program
Integer Program
3xy 11 -x2y 5 x, y 0
Linear constraints
Integral constraints (nonlinear)
x, y ? 0, 1
RAPTOR integer programming-based
threading perhaps the best protein threading
system
18Threading with Integer Programming
- x(i,k) denotes that core i is aligned to
sequence position k - y(i,k,j,l) denotes that core i is aligned to
position k and core j is aligned to position l - D(i) all positions where core i can be aligned
to - R(i, j, k) set of possible alignments of core j,
given that core i aligns to position k - corei (headi, taili, lengthi taili headi
1)
19Threading with Integer Programming
Cores are aligned in order
Each y variable is 1 if and only if its two x
variables are 1 x and y represent exactly the
same threading
Each core has only one alignment position
20Energy Function is Linear
- Sequence substitution score
- Fitness of aa in each position (example,
hydrophobicity) - Agreement with secondary structure prediction
- Pairwise interaction between two cores
- Gap between two successive cores
21LP Relaxation and (again) Branch Bound
- Relax the integral constraint, to
- x(i,j), y(i,k,j,l) ? 0
- Solve the LP using a standard method
- (RAPTOR uses IBMs OSL)
- If resulting solution is integral, done
- Else, select one non-integral variable
(heuristically), and generate two subproblems by
setting it to 0, and 1 -- use Branch Bound - In practice, in RAPTOR only 1 of the instances
in the test database required step 4 almost
all solutions are integral !!!
22CAFASP
- GOAL
- The goal of CAFASP is to evaluate the performance
of fully automatic structure prediction servers
available to the community. In contrast to the
normal CASP procedure, CAFASP aims to answer the
question of how well servers do without any
intervention of experts, i.e. how well ANY user
using only automated methods can predict protein
structure. CAFASP assesses the performance of
methods without the user intervention allowed in
CASP.
23Performance Evaluation in CAFASP3
Servers with name in italic are meta servers
MaxSub score ranges from 0 to 1 Therefore,
maximum total score is 30
(http//ww.cs.bgu.ac.il/dfischer/CAFASP3,
released in December, 2002.)
24One structure where RAPTOR did best
Red true structure Blue correct part of
prediction Green wrong part of prediction
- Target Size144
- Super-imposable size within 5A 118
- RMSD1.9
25Some more results by other programs
26Some more results by other programs
27Some more results by other programs
28Structural Motifs
beta helix
beta barrel
beta trefoil
29Structural Motif Recognition
- Secondary Structure Prediction
- Find the ? helices, ? sheets, loops in a protein
sequence - Given an amino acid residue sequence, does it
fold as a - Coiled Coil?
- ? helix?
- ? barrel?
- Zinc finger?
- Intermediate goals towards folding
- Useful information about the function of a
protein - More amenable to sequence analysis, than full
fold prediction
30Structural Motif Recognition
- Collect a database of known motifs and
corresponding amino acid subsequences - Devise a method/model to match a new sequence
to existing motif database - Verify computationally on a test set (divide
database into training and testing subsets) - Verify in lab
31Structural Motif Recognition Methods
- Alignment
- Neural Nets
- Hidden Markov Models
- Threading
- Profile-based Methods
- Other Statistical Methods
32Predicting Coiled Coils
33Predicting Coiled Coils
- NewCoils multiply probs of frequencies in each
coiled coil position
34Predicting Coiled Coils
- PairCoil multiply pairwise probs of spatially
neighboring positions
- Use a sliding window of length 28
- Perfect score separation between true and false
examples (false non-coil-coil ? helices) - Berger et al. PNAS 1995
35Predicting ? helices
- Helix composed of three parallel ? sheets
- Very few solved structures, very different from
one another - Absent in eukaryotes!
- Probably evolved subsequent to prok/euk split
36Predicting ? helices
- Only available program BetaWrap
- The rungs subproblem
- Given the location of a T2 turn of one rung, find
location of T2 turn of next rung - Distribution of turn lengths
- Bonus/penalty for stacked pairs in the parallel
strands - Discard if highly charged residues in the
inward-point positions of ? strand - From a rung to multiple rungs
- Find multiple initial B2-T2-B3 rungs
- Use sequence template based on hydrophobicity to
find many candidate rungs - Find optimal wrap by DP heuristic score,
based on 5 consistent rungs - Completing the parse
- Find B1 strands by locally optimizing their
location
37Predicting ? helices
- BetaWrap gives scores that separate true from
false ? helices - Bradley et al. PNAS 2001
38Predicting ? trefoils
http//betawrappro.csail.mit.edu/ Similar idea
use a combination of domain-specific expert
knowledge with statistics WRAP-AND-PACK WRAP
Search for antiparallel ? strands to wrap a
cap PACK Place the side chains in the interior
of the wrapped ? strands
39Predicting Secondary Structure
- Given amino acid sequence, classify positions
into ? helices, ? strands, or loops - In general, harder than protein motif
identification - Best methods rely on Neural Networks
- Similarly good separation can be achieved by SVMs
- PSIPRED
- Given a sequence x, generate profile using
PSI-BLAST - Pass the profile to a pre-trained NN
- Output classification ? helix / ? strand / loops
40PSIPRED
Profile M
- Training Testing
- Start with database of determined folds (lt1.87
Ao) - Remove redundancy any pair of proteins with
high similarity (found by PSI-BLAST) 187
remaining proteins - 3-fold cross validation
- 76 classification accuracy
41PSIPRED server
- PSIPRED PREDICTION RESULTS
- Conf Confidence (0low, 9high)
- Pred Predicted secondary structure (Hhelix,
Estrand, Ccoil) - AA Target sequence
-
- PSIPRED HFORMAT (PSIPRED V2.3 by David Jones)
- Conf 9888788777656877765688766579
- Pred CCCCCCCCCCCCCCCCCCCCCCCCCCCC
- AA PEPTIDEPEPTIDEPEPTIDEPEPTIDE
- Conf Confidence (0low, 9high)
- Pred Predicted secondary structure (Hhelix,
Estrand, Ccoil) - AA Target sequence
-
PSIPRED PREDICTION RESULTS Conf Confidence
(0low, 9high) Pred Predicted secondary
structure (Hhelix, Estrand, Ccoil) AA
Target sequence PSIPRED HFORMAT (PSIPRED
V2.3 by David Jones) Conf 9988888721001112100121
12359 Pred CCCCCCCCCCCHHHHHHHHCCCCCCCC AA
PTYPTYPTXXXXXXXXXXXXTEETEET PSIPRED PREDICTION
RESULTS Conf Confidence (0low, 9high) Pred
Predicted secondary structure (Hhelix, Estrand,
Ccoil) AA Target sequence PSIPRED
HFORMAT (PSIPRED V2.3 by David Jones) Conf
91025687432236422336410232027743223653334679 Pred
CCCCCCCCCCCCCCCCCCCCCCCEEEECCCCCCCCCCCCCCCCC
AA THISISAPRXTEINSEQXENCETHISISAPRXTEINSEQXENCE
42TRILOGY SequenceStructure Patterns
- Identify short sequencestructure patterns 3
amino acids - Find statistically significant ones
(hypergeometric distribution) - Correct for multiple trials
- These patterns may have structural or functional
importance - Pseq R1xa-bR2xc-dR3
- Pstr 3 C? C? distances, 3 C? C? vectors
- Start with short patterns of 3 amino acids
- V, I, L, M, F, Y, W, D, E, K, R, H, N,
Q, S, T, A, G, S - Extend to longer patterns
- Bradley et al. PNAS 998500-8505, 2002
43TRILOGY
44TRILOGY Extension
Glue together two 3-aa patterns that overlap in 2
amino acids
P-score ?iMpat,,min(Mseq, Mstr) C(Mseq, i)
C(T Mseq, Mstr i) C(T, Mstr)-1
45TRILOGY Longer Patterns
?-?-? unit found in three proteins with the
TIM-barrel fold
NAD/RAD binding motif found in several folds
Type-II ? turn between unpaired ? strands
Helix-hairpin-helix DNA-binding motif
A ?-hairpin connected with a crossover to a third
?-strand
Three strands of an anti-parallel ?-sheet
A fold with repeated aligned ?-sheets
Four Cysteines forming 4 S-S disulfide bonds
46Small Libraries of Structural Fragments for
Representing Protein Structures
47Fragment Libraries For Structure Modeling
predicted structure
known structures
48Small Libraries of Protein Fragments
- Kolodny, Koehl, Guibas, Levitt, JMB 2002
- Goal
- Small alphabet of protein structural fragments
that can be used to represent any structure - Generate fragments from known proteins
- Cluster fragments to identify common structural
motifs - Test library accuracy on proteins not in the
initial set
49Small Libraries of Protein Fragments
- Dataset 200 unique protein domains with most
reliable distinct structures from SCOP - 36,397 residues
- Divide each protein domain into consecutive
fragments beginning at random initial position - Library Four sets of backbone fragments
- 4, 5, 6, and 7-residue long fragments
- Cluster the resulting small structures into k
clusters using cRMS, and applying k-means
clustering with simulated annealing - Cluster with k-means
- Iteratively break join clusters with simulated
annealing to optimize total variance S(x µ)2
50Evaluating the Quality of a Library
- Test set of 145 highly reliable protein
structures (Park Levitt) - Protein structures broken into set of overlapping
fragments of length f - Find for each protein fragment the most similar
fragment in the library (cRMS) - Local Fit Average cRMS value over all fragments
in all proteins in the test set - Global Fit Find best composition of structure
out of overlapping fragments - Complexity is O(LibraryN)
- Greedy approach extends the C best structures so
far from posn 1 to N
51Results
C