Title: Protein Structure Prediction
1Protein Structure Prediction
- Xiaole Shirley Liu
- And
- Jun Liu
- STAT115
2Protein Structure Prediction Ram
Samudrala University of Washington
3Outline
- Motivations and introduction
- Protein 2nd structure prediction
- Protein 3D structure prediction
- CASP
- Homology modeling
- Fold recognition
- ab initio prediction
- Manual vs automation
- Structural genomics
4Protein Structure
- Sequence determines structure, structure
determines function - Most proteins can fold by itself very quickly
- Folded structure lowest energy state
5Protein Structure
- Main forces for considerations
- Steric complementarity
- Secondary structure preferences (satisfy H bonds)
- Hydrophobic/polar patterning
- Electrostatics
6Rationale for understanding protein structure and
function
Protein sequence -large numbers of sequences,
including whole genomes
?
Protein function - rational drug design and
treatment of disease - protein and genetic
engineering - build networks to model cellular
pathways - study organismal function and evolution
7Protein Databases
- SwissProt protein knowledgebase
- PDB Protein Data Bank, 3D structure
8View Protein Structure
- Free interactive viewers
- Download 3D coordinate file from PDB
- Quick and dirty
- VRML
- Rasmol
- Chime
- More powerful
- Swiss-PdbViewer
9Compare Protein Structures
- Structure is more conserved than sequence
- Why compare?
- Detect evolutionary relationships
- Identify recurring structural motifs
- Predicting function based on structure
- Assess predicted structures
- Protein structure comparison and classification
- Manual SCOP
- Automated DALI
10Compare protein structures
- Need ways to determine if two protein structures
are related and to compare predicted - models to experimental structures
- Commonly used measure is the root mean square
deviation (RMSD) of the Cartesian - atoms between two structures after optimal
superposition (McLachlan, 1979) - Â
- Usually use Ca atoms
- Â
- Other measures include contact maps and torsion
angle RMSDs
11SCOP
- Compare protein
- structure, identify
- recurring structural
- motifs, predict function
- A. Murzin et al, 1995
- Manual classification
- A few folds are highly
- populated
- 5 folds contain 20 of all homologous
superfamilies - Some folds are multifunctional
12Determine Protein Structure
- X-ray crystallography (gold standard)
- Grow crystals, rate limiting, relies on the
repeating structure of a crystalline lattice - Collect a diffraction pattern
- Map to real space electron density, build and
refine structural model - Painstaking and time consuming
13Protein Structure Prediction
- Since AA sequence determines structure, can we
predict protein structure from its AA sequence? - predicting the three angles, unlimited DoF!
- Physical properties that determine fold
- Rigidity of the protein backbone
- Interactions among amino acids, including
- Electrostatic interactions
- van der Waals forces
- Volume constraints
- Hydrogen, disulfide bonds
- Interactions of amino acids with water
14Protein folding landscape Large
multi-dimensional space of changing conformations
free energy
folding reaction
15Protein primary structure
162nd Structure Prediction
- ? helix, ? sheet, turn/loop
172nd Structure Prediction
- Chou-Fasman 1974
- Base on 15 proteins (2473 AAs) of known
conformation, determine P?, P? from - ? 0.5-1.5
- Empirical rules for 2nd struct nucleation
- 4 H? or h? out of 6 AA, extends to both dir, P? gt
1.03, P? gt P?, no ? breakers - 3 H? or h? out of 5 AA, extends to both dir, P? gt
1.05, P? gt P?, no ? breakers - Have 50-60 accuracy
18P? and P?
192nd Structure Prediction
- Garnier, Osguthorpe, Robson, 1978
- Assumption each AA influenced by flanking
positions - GOR scoring tables (problem limited dataset)
- Add scores, assign 2nd with highest score
202nd Structure Prediction
- D. Eisenberg, 1986
- Plot hydrophobicity as function of sequence
position, look for periodic repeats - Period 3-4 AA, ? (3.6 aa / turn)
- Period 2 AA, ? sheet
- Best overall JPRED by Geoffrey Barton, use many
different approaches, get consensus - Overall accuracy 72.9
213D Protein Structure Prediction
- CASP contest Critical Assessment of Structure
Prediction - Biannual meeting since 1994 at Asilomar, CA
- Experimentalists before CASP, submit sequence of
to-be-solved structure to central repository - Predictors download sequence and minimal
information, make predictions in three categories - Assessors automatic programs and experts to
evaluate predictions quality
22CASP Category I
- Homology Modeling (sequences with high homology
to sequences of known structure) - Given a sequence with homology gt 25-30 with
known structure in PDB, use known structure as
starting point to create a model of the 3D
structure of the sequence - Takes advantage of knowledge of a closely related
protein. Use sequence alignment techniques to
establish correspondences between known
template and unknown.
23CASP Category II
- Fold recognition (sequences with no sequence
identity (lt 30) to sequences of known structure
- Given the sequence, and a set of folds observed
in PDB, see if any of the sequences could adopt
one of the known folds - Takes advantage of knowledge of existing
structures, and principles by which they are
stabilized (favorable interactions)
24CASP Category III
- Ab initio prediction (no known homology with any
sequence of known structure) - Given only the sequence, predict the 3D structure
from first principles, based on energetic or
statistical principles - Secondary structure prediction and multiple
alignment techniques used to predict features of
these molecules. Then, some method necessary for
assembling 3D structure.
25Structure Prediction Evaluation
- Hydrophobic core similar?
- 2nd struct identified?
- Energy minimized? H-bond contacts?
- Compare with solved crystal structure gold
standard
26Comparative modelling of protein structure
refine
27Homology Modeling Results
- When sequence homology is gt 70, high resolution
models are possible (lt 3 Ã… RMSD) - MODELLER (Sali et al)
- Find homologous proteins with known structure and
align - Collect distance distributions between atoms in
known protein structures - Use these distributions to compute positions for
equivalent atoms in alignment - Refine using energetics
28Homology Modeling Results
- Many places can go wrong
- Bad template - it doesnt have the same structure
as the target after all - Bad alignment (a very common problem)
- Good alignment to good template still gives wrong
local structure - Bad loop construction
- Bad side chain positioning
29Homology Modeling Results
- Use of sensitive multiple alignment (e.g.
PSI-BLAST) techniques helped get best alignments - Sophisticated energy minimization techniques do
not dramatically improve upon initial guess
30Fold Recognition Results
- Also called protein threading
- Given new sequence and library of known folds,
find best alignment of sequence to each fold,
returned the most favorable one
31Fold Recognition with Dynamic Programming
- Environmental class for each AA based on known
folds (buried status, polarity, 2nd struct)
32Protein Folding with Dynamic Programming
- D. Eisenburg 1994
- Align sequence to each fold (a string of
environmental classes) - Advantages fast and works pretty well
- Disadvantages do not consider AA contacts
33Fold Recognition Results
- Each predictor can submit N top hits
- Every predictor does well on something
- Common folds (more examples) are easier to
recognize - Fold recognition was the surprise performer at
CASP1. Incremental progress at CASP2, CASP3,
CASP4
34Fold Recognition Results
- Alignment (seq to fold) is a big problem
35ab initio
- Predict interresidue contacts and then compute
structure (mild success) - Simplified energy term reduced search space
(phi/psi or lattice) (moderate success) - Creative ways to memorize sequence ?? structure
correlations in short segments from the PDB, and
use these to model new structures ROSETTA
36Ab initio prediction of protein structure
sample conformational space such that native-like
conformations are found
hard to design functions that are not fooled by
non-native conformations (decoys)
astronomically large number of conformations 5
states/100 residues 5100 1070
37Sampling conformational space continuous
approaches
- Most work in the field
- Molecular dynamics
- Continuous energy minimization (follow a valley)
- Monte Carlo simulation
- Genetic Algorithms
- Like real polypeptide folding process
- Cannot be sure if native-like conformations are
sampled
38Molecular dynamics
- Force -dU/dx (slope of potential U)
acceleration, force m a(t) - All atoms are moving so forces between atoms are
complicated functions of time - Analytical solution for x(t) and v(t) is
impossible numerical solution is trivial - Atoms move for very short times of 10-15 seconds
or 0.001 picoseconds (ps) -
- x(tDt) x(t) v(t)Dt 4a(t) a(t-Dt)
Dt2/6 - v(tDt) v(t) 2a(tDt)5a(t)-a(t-Dt) Dt/6
- Ukinetic ½ S mivi(t)2 ½ n KBT
- Total energy (Upotential Ukinetic) must not
change with time
acceleration
old velocity
old position
new position
new velocity
n is number of coordinates (not atoms)
39Energy minimization
- For a given protein, the energy depends on
thousands of x,y,z Cartesian atomic coordinates
reaching a deep minimum is not trivial - Furthermore, we want to minimize the free energy,
not just the potential energy.
40Monte Carlo Simulation
- Propose moves in torsion or Cartesian
conformation space - Evaluate energy after every move, compute ?E
- Accept the new conformation based on
- If run infinite time, the simulated conformation
follows the Boltzmann distribution - Many variations, including simulated annealing
and other heuristic approaches.
41Scoring/energy functions
- Need a way to select native-like conformations
from non-native ones - Physics-based functions electrostatics, van der
Waals, solvation, bond/angle terms. - Knowledge-based scoring functions
- Derive information about atomic properties from a
database of experimentally determined
conformations - Common parameters include pairwise atomic
distances and amino acid burial/exposure.
42Rosetta
- D. Baker, U. Wash
- Break sequence into short segments (7-9 AA)
- Sample 3D from library of known segment
structures, parallel computation - Use simulated annealing (metropolis-type
algorithm) for global optimization - Propose a change, if better energy, take
otherwise take at smaller probability - Create 1000 structures, cluster and choose one
representative from each cluster to submit
43Manual Improvements and Automation
- Very often manual examination could improve
prediction - Catch errors
- Need domain knowledge
- A. Murzins success at CASP2
- CAFASP Critical Assessment of Fully Automated
Structure Prediction - Murzin Cant play!!
- MetaServers combine different methods to get
consensus
44CAFASP Evaluation
45Structural Genomics
- With more and more solved structures and novel
folds, computational protein structure prediction
is going to improve - Structural genomics
- Worldwide initiative to high throughput determine
many protein structures - Especially, solve structures that have no homology
46Summary
- Protein structures 1st, 2nd, 3rd, 4th
- Different DB SwissProt, PDB and SCOP
- Determine structure X-ray crystallography
- Protein structure prediction
- 2nd structure prediction
- Homology modeling
- Fold recognition
- Ab initio
- Evaluation energy, RMSD, etc
- CASP and CAFASP contest
- Manual improvement and combination of
computational approaches work better - Structural Genomics, still very difficult problem
47Acknowledgement
- Amy Keating
- Michael Yaffe
- Mark Craven
- Russ Altman