Title: An Optimization Approach to Protein Structure Prediction
1An Optimization Approach to Protein Structure
Prediction
- Richard Byrd
- Betty Eskow
- Robert Schnabel
- Brett Bader
- Lianjun Jiang
- University of Colorado
- Teresa Head-Gordon
- Univ. of California, Berkeley
- Silvia Crivelli
- Lawrence Berkeley Laboratory
2Problem Definition
Predict the 3-dimensional shape, or native
state, of a protein given its sequence of
constituent amino acids.
Approach
Assuming the native state of a protein
corresponds to its minimum free energy state, use
a global optimization method to find the minimum
energy configuration of the target protein.
3Importance of Protein Folding
- 3-Dimensional structure useful in molecular drug
design.
- Laboratory experiments are expensive
- X-ray crystallography
- NMR
- Genome projects are providing sequences for many
proteins whose structure will need to be
determined.
4Protein Structures
Proteins consist of a long chain of amino acids
called the primary structure.
Pro
Gly
Leu
Ser
The constituent amino acids may encourage
hydrogen bonding and form regular structures,
called secondary structures.
a-helix
b-sheet
The secondary structures fold together to form a
compact 3-dimensional or tertiary structure.
5Chemistry of Proteins
Side chain
Amino acid
Backbone
H-bond
Hydrogen bonds strongly influence a proteins
shape. They largely occur in secondary
structures and help hold the protein together.
6Computational Approaches to Protein Structure
Prediction
- Comparative Modeling
- Compares and aligns to a known protein sequence
of amino acids - Fold Recognition
- Searches for the best fitting fold template from
a library of known protein folds - New Fold Methods
- Not based on knowledge of complete protein
sequences or folds - e.g. energy minimization
7Global Optimization Problem
The 3-dimensional structure of the protein found
in nature is believed to minimize potential
energy Min V(x) where x atom coordinates
Challenges
8Amber Energy Function
V(x)
S
cl(b - b0)2
(b bond length) ? ?
bonds
(q bond angle) ? ?
?
S
ca(q - q0)2
bond angles
S
cd1 cos(n ??)
(w dihedral angle) ? ?
? ?
dihedral angles
S
(rij distance)
charged pairs
S
cwj(rij)
(j Lennard-Jones potential)
nonbonded pairs
Internal coordinates are determined using bonds,
bond angles and dihedral angles
Internal coordinates are determined using bonds,
bond angles and dihedral angles.
9Additional energy terms to model protein
behavior in an aqueous environment
- Formulated from simulations of pairs of
hydrophobic molecules in water - ESOLVATION
-
- Advantages of this model
- Provides stabilizing force for forming
hydrophobic cores. - Well defined model of the hydrophobic effect of
small hydrophobic groups in water. - Computationally tractable and differentiable
i,j are aliphatic carbons, M Gaussians with
position(ck ), depth(hk) and width(wk) describe
2 minima (1) molecules in contact and
(2)mol-ecules separated by a distance of 1 water
molecule.
10Global Optimization Approaches
- Deterministic methods
- Branch and bound, interval methods
- Very reliable, deterministic guarantees
- Too expensive for more than 20-50 variables
- Stochastic methods
- Random steps or sampling
- Probabilistic guarantees
- Practical for lt 300 variables
- Heuristic search
- e.g. Simulated annealing, Tabu search, Genetic
algorithms - Effective on some very large problems
- No practical guarantees
11A Stochastic-Perturbation Global Optimization
Approach
- Generate and maintain a pool of candidates
(configurations), as in genetic algorithms. - Solve the full-dimensional problem as a series of
small-dimensional ones. - Use protein database information to bias toward
likely substructures.
12Algorithm Phases
Given the amino acid sequence of a protein, find
the 3-dimensional structure likely to be found in
nature.
Simplify problem by utilizing domain-specific
knowledge
Generate Initial Population
Global Optimization
Phase 1
Phase 2
13 Phase 1 Create Initial Population
- Submit amino acid sequence to server
- EFIAIYDYKAETEEDLTIKKGEKLEIIEKEGDWWKAKAIGSGEI
GY - IPANYIAAAE
- Use server predictions to determine the
location of a-helices, ß-strands, and coils - CCCCHHHHHHEEEEEEEEEEEECCEEEEEEEEEEEHHHHHHHHCCC
- HHHHHHCCCC
- Use ProteinShop visualization tool to form
configurations with secondary structure - Assign ideal values to the dihedral angles
in the sequence according to the predictions.
Manipulate ß-strands to form ß-sheets. - ? Perform Energy Minimizations
14Phase 2Improve Local Minima
Select a protein and a subset of dihedral angles
- Uses a combination of breadth-first and
depth-first searches from initial pool - Dihedral angles act as internal coordinates and
reduce the number of variables, speeding an
optimization run
Small-scale global optimization
Full-dimensional local optimization
iterate
Cluster minima and test stopping criteria
15Small Scale Global Optimization in Phase 2
- Minimize energy over 5-20 torsion angles
- Use a stochastic global optimization algorithm
base on sampling, sample pruning and local
minimization (Rinooy-Kan et al). - From best start points, do local minimizations
using quasi-Newton
16Full-scale local minimizations
- Using best points from small-scale global, do
local minimizations. - Because of problem size we use limited-memory
quasi-Newton. - Best local minimizers are added to pool.
17Biasing functions
- Used to form secondary structure during in first
phase and sometimes infull-dimensional local
minimizations. - Dihedral angle biasing
- E?? ? dihedrals k f1 cos(f - f0) k?1
cos(? - ?0) - Hydrogen Bond biasing
- For ?-helices
- EHB wiwi4 / Dri,i4 (ws are weights
from the server for residues i and i4 in
the helix) - To form ?-sheets from ?-strands
- EHB? wiwj / Dri,j
18Neural Network Predictions
Sequence
SKIGIDGFGRIGRLVLRAALSCGAQ
Neural nets trained on a large database of
proteins can predict secondary structure likely
to be in a target protein.
Sequence Type Weight
SKIGIDGFGRIGRLVLRAALSCGAQ BBBB B AAAAAAA
BBBBB 13552 6789992 56673
19Forming ß-sheets from the predicted ?-strands
is a combinatorial problem.
Which strands are paired?
?
?
?
Which orientation?
anti-parallel
parallel
Which residues are paired?
even
odd
20- Distribution of Beta Sheets in
Proteins with Applications to Structure
Prediction - Ruckzinski, Kooperberg, Bonneau,
and Baker - Proteins 48, 2002
21Parallel Organization
- Select k subsets of dihedral angles
- Maintain a queue of (configuration,subspace) for
k optimization crews to work on - Each optimization crew performs a small-scale
global optimization of its assigned configuration
and subspace. - Gather intermediate results and re-insert them
into the work queue. Idle optimization crews do
full-dimensional local minimizations or
additional small-scale global optimization. - ?Massively parallel exploration of optimization
space - Automatic load balancing
222UTG_A 7.5Ã… R.M.S.D. from Crystal
1POU 6.3Ã… R.M.S.D. from NMR structure
23CASP competition
- Community-wide experiment on the Critical
Assessment of Techniques for Protein Structure
Prediction - ? Protein crystallographers and NMR
spectroscopists provide structures prior to their
publication for blind prediction by participants. - ? Biannual competition open to all
computational methods including servers. - ? Difficulty of targets assessed by which type
of methods work to predict the structure CM,
FR, NF. - ? We participated in CASP4 (Dec. 2000) and
CASP5 (Dec. 2002).
24Our submitted CASP4 models ranked by target
difficulty and relative accuracy
25Results on Phospholipase C beta C-terminus,
turkey (containing 242 amino acids). Ribbon
structure comparison between experiment (center),
submitted M1 prediction (right), our lowest
energy submission, had an RMSD with experiment of
8.46Ã…, and next generation run of the global
optimization algorithm (left). This new run
lowered the energy of our previous best
minimizer, resulting in a new structure with an
RMSD of 7.7Ã….
26CASP4 Results Summary
- ? Best structure predicted on one of the
hardest targets - ? Our method is more effective than some
knowledge-based methods on targets for which less
information from known proteins is available. - ? Global optimization algorithm is very
effective at improving structures from a small
initial population.
27Our submitted CASP5 models ranked by target
difficulty and relative accuracy
28Our submitted CASP5 models of targets (domains)
that were assessed in the CASP5 NEW FOLD
category.
29Our submissions for CASP5 Target 162
30CASP5 Results Summary
- Ranked 15/165 groups in assessments of New
Fold (and NF/FR) Results. - Our method uses less knowledge from known protein
structures than most other (New Fold) methods
participating in CASP5 - More diverse starting populations (especially for
?-sheet proteins) using the visualization tool
led to better performance in some cases.
31Future Research Directions
- Simpler energy models for early stages of the
algorithm, and alternative models of solvation. - New techniques for choosing ?-strand pairings.
- Improve our techniques for maintaining existing
secondary structure in our models.