Title: Protein Structure Prediction
1Protein Structure Prediction
2Protein structure
- Most proteins will fold spontaneously in water,
so amino acid sequence alone should be enough to
determine protein structure - However, the physics are daunting
- 20,000 protein atoms, plus equal amounts of
water - Many non-local interactions
- Can takes seconds (most chemical reactions take
place 1012 --1,000,000,000,000x faster) - Empirical determinations of protein structure are
advancing rapidly.
3Protein review
- Proteins are polymers of amino acids linked by
peptide bonds. - Properties of proteins are determined by both the
particular sequence of amino acids and by the
conformation (fold) of the protein. - Flexibility in thebonds around C?
- ? (phi)
- ? (psi)
- sidechain
4Protein Structure Levels
- Protein structure is described in four levels
- Primary structure amino acid sequence
- Secondary structure local (in sequence) ordering
into - (?)Helices compressed, corkscrew structures
- (?)Strands extended, nearly straight structures
- (?)Sheets paired strands, reinforced by hydrogen
bonds - parallel (same direction) or antiparallel sheets
- Coils, Turns Loops changes in direction
- Tertiary structure global ordering (all
angles/atoms) - Quaternary structures multiple, disconnected
amino acid chains interacting to form a larger
structure
5Protein structure cartoons
6Protein Structure Representations
- Differentvisualizationsshow variousaspects
ofstructure
7Protein Folding
- Proteins are created linearly and then assume
their tertiary structure by folding. - Exact mechanism is still unknown
- Mechanistic simulations can be illuminating
- Proteins assume the lowest energy structure
- Or sometimes an ensemble of low energy
structures. - Hydrophobic collapse drives process
- Local (secondary) structure proclivities
- Internal stabilizers
- Hydrogen bonds, disulphide bonds, salt bridges.
8Empirical structure determination
- Two major experimental methods for determining
protein structure - X-ray Crystallography
- Requires growing a crystal of the protein
(impossible for some, never easy) - Diffraction pattern can be inverse-Fourier
transformed to characterize electron densities
(Phase problem) - Nuclear Magnetic Resonance (NMR) imaging
- Provides distance constraints, but can be hard to
find a corresponding structure - Works only for relatively small proteins (so far)
9X-ray crystallography
- X-rays, since wavelength is near the distance
between bonded carbon atoms - Maps electron density, not atoms directly
- Crystal to get a lot of spatially aligned atoms
- Have to invert Fourier transform to get
structure, but only have amplitudes, not phases - Guess! orperturb...
10NMR structure determination
- NMR can detect certain features of hydrogen
atoms - NOESY measures distances between non-bonded H's
within about 5A - COSY and TOCSY described relations through bonds
- Combination of distance and angle constraints,
plus knowledge of covalent bonds (amino acid
sequence) determines a unique (sometimes)
structure. - Overlapping measurement limits size 120AA
11Why predict protein structure?
- Neither crystallography nor NMR can keep pace
with genome sequencing efforts - Only 10566 (3641 with lt90 identity) human
proteins in PDB, although growing fast - Computer scientists love this problem
- Understandable with minimal biology
- Seems like a good discrimination task
- Understand the mechanisms of folding (?)
- First computational Nobel prize?
12Kinds of Structure Prediction
- Comparative modelling
- Homolog has known structure, which is adjusted
for sequence differences - Energy minimization and molecular dynamics
- Fold recognition
- Proteins fall into broad fold classes. Models of
folds that recognize compatible sequences.
Inverse problem - Predict more than fold class?
- Ab initio or new fold prediction
- No homologs, not recognized by any fold model
13Ab Initio predictions
- Three broad approaches
- Molecular dynamics, energy minimization
approaches - Empirical black box (induce discriminators)
- Mechanistic (follow the actual folding path)
approaches. Hybrid between energy and empirical
methods. - Secondary structure predictions
- Not tremendously useful nor accurate, but
simplest. - Can play a role in tertiary predictors
- Tertiary structure predictions
- Best involve a complex mixture of approaches
14Energy Minimization
- Many forces act on a protein
- Hydrophobic inside of protein wants to avoid
water - Packing atoms can't be too close, nor too far
away - Bond angle/length constraints
- Long distance, e.g.
- Electrostatics Hydrogen bonds
- Disulphide bonds
- Salt bridges
- Can calculate all of these forces, and minimize
- Intractable in general case, but can be useful
15Empirical models
- Pose structure prediction as induction task.
- What are the inputs and outputs?
- Where do we get enough training data?
- Which induction methods work best?
- Long history in bioinformatics
16Initial approaches to secondary structure
prediction
- Input is a "sliding window" of immediately
surrounding sequence assumed to determine
structure (no long distance interactions)
...mnnstnssnsgla...
H - Output is one of three possible secondary
structure states helix, strand, other
17Why might this work?
- There are local propensities to secondary
structural classes (largely hydropathy) - Helices no prolines, sometimes amphipathic (show
alternating hydropathy with period 3.6 residues) - Strands either alternating hydropathy or ends
hydrophillic and center hydrophobic - Neither small, polar flexible residues.
Prolines. - Minimum lengths for secondary structures (helices
longer than strands)
18Early methods
- Chou-Fasman method looked at frequency of each
amino acid in window - GOR defined an information measure I(SR)
logP(SR)/P(S) where S is secondary structure
and R is amino acid. Define information gain as
I(SR) - I(SR) and predict state with
highest gain. - How to combine info gain for each element of
sliding window? Independently (just add) or by
pairs
19How well did they work?
- Not very Roughly 50-55 accurate on a residue by
residue basis. - Random prediction that obeyed the observed
distribution of helix/strand/other would be 40 - Different ways to calculate "correctness"
- Needs to be unbiased (especially wrt homology)!
- Getting number of helices and strands or order
right is harder than just counting residue by
residue (like the difference between nucleotide
and exon level gene finding).
20Fancier induction techniques
- Same setup as Chou-Fasman or GOR
- Sliding window across amino acid sequence as
input - Three class output (helix/sheet/other)
- Various different induction techniques over same
data, give modest improvements - LDA/QDA
- Decision trees
- Neural networks
- Best results from neural networks ( 62)
21Add multiple sequence alignment information
- This is helpful in principle
- insertions/deletions more likely to be coil/turn
- conserved hydropathy more important for
prediction than non-conserved. - GOR method improves 8-9 points (to about 64
correct residue by residue). - Similar improvement for NNs (to 68)
- SVMs gain a bit more, to about 70
22But the information isn't there
- Prediction quality has not improved much even
with huge growth of training data. - Secondary structure is not completely determined
by local forces - Long distance interactions do not appear in
sliding window - Empirical studies show same amino acid sequences
can assume multiple secondary structures.
23Mechanistic models
- Move from purely empirical to include some
knowledge of folding mechanisms - Compact nature of conformations
- Hydrophobic packing
- Sequences of secondary structures
- Secondary structure predispositions
- Heuristic global energy minimization
24Hydrophobic packing models
- Dill's HP model
- Two classes of amino acids, hydrophobic (H) and
polar (P) - Lattice model for position of (point) amino
acids. - Thread chain of H's and P's through lattice to
maximize number of H-H contacts
3D
2D
25But...
- Even the 2D HP packing problem (which is easier
than the 3D one) turns out to be NP complete! - Good approximation results exist.
- 3/8 of optimal approximation (3D)
- In triangular lattice, algorithm for gt60 of
optimal packing - Other interesting results in the model, e.g.
- Which sequences have a single optimal fold?
26CASP changed the landscape
- Critical Assessment of Structure Prediction
competition. Even numbered years since 1994 - Solved, but unpublished structures are posted in
May, predictions due in September, evaluations in
December - Various categories
- Relation to existing structures, ab initio,
homology, fold, etc. - Partial vs. Fully automated approaches
- Produces lots of information about what aspects
of the problems are hard, and ends arguments
about test sets. - Results showing steady improvement, and the value
of integrative approaches.
27CASP 6 Categories
- Human intervention versus fully automated
predictions - Comparative modeling
- A structure exists for a good homolog
- Looking for mutations, bond rotations, etc.
- Fold recognition (Homologs)
- Distant homolog recognition and adaptation
- Looking at loop placement, domain boundaries
- Fold recognition (Analogous)
- No homolog, but similar structures in DB
- Finding the right model structure
- Ab Initio
- No similar structures in DB. Most fundamental
problem. - Other issues
- Domain boundaries, disordered regions,
residue-residue contacts
28CASP Results
- Fully automated methods now nearly as good as
ones with human intervention - Consensus methods (looking for agreement among
servers) do best overall, but not by much and not
all the time. - Consistent best approach is Rosetta from David
Bakers lab
29CASP performance improving
30Baker best strategy so far
- Two step process
- Generate a good sized collection of plausible
structures and near-miss bad structures - Requires a good energy function, good
optimization approach - Quality of decoy (incorrect, but plausible
folds) is important - Build discriminators to separate correct from
decoy structures. - Rosetta (Baker lab) and fully automated Robetta.
- Ran away with CASP4, still the best at CASP5 6
- Robetta almost as good as Rosetta
- Outstanding at ab initio, competitive at the
rest.
31Rosetta approach
- Integrated method
- I-Sites much finer grained substructures than
secondary structures. A library of all
consistent structures of short polypeptides is
defined (taken from PDB) - Build initial models by assigning I-sites to new
amino acid sequence (many possibilities) - Monte Carlo search through assignments of I-Sites
to minimize energy function. - Use of sophisticated global energy function
- Take good scoring structures, and test them on a
decoy detector, which looks for high scoring
but non-native structure patterns.
32I-Sites
- I-sites are a set of sequence patterns that
strongly correlate with protein structure at the
local level. - Ungapped amino acid sequence motifs
- Length 3-9 (now longer)
- Originally 82 classes (now more)
- Defined by amino acid log odds matrix and phi/psi
angles - Far more detailed than the 3 state
helix/sheet/other local structure models
33Example I-Site
- Proline containing alpha helix C-cap
f/j
cartoon
AA log odds
member structures
Motif position
34How I-sites are defined
- Starting from all sequences in PDB at the time
- Remove sequences with 25 or greater sequence
identity to compensate for oversampling of
certain families - Cluster all possible subsequences of these
structures of length 3-15. - For each cluster, define paradigm structure.
- Remove members that are too far away structurally
- Add new members that are structurally similar
- If can't distinguish well (bimodal scores)
between members and non-members, drop the cluster
35I-sites are not unique
- One amino acid subsequence may be compatible with
several I-sites - I-sites are not defined to be mutually exclusive
over sequence. - Slightly different starting positions or lengths
may yield quite different (even incompatible)
I-sites for the same sequence region. - This has biological relevance
- Local predispositions are not determinative or
unique - Multiple predispositions are more informative
than none.
36I-sites pro and con
- Lots of ad hoc fiddling to get I-site library
- Distance measure on sequence has two free
parameters - Many different structure distance measures tried
- K-means clustering (K is free parameter)
- Test for bimodal scoring (two more parameters)
- Occasional subdivision of an I-Site that seemed
to have two good structures associated with it - Corresponds reasonably well to existing
crystallographic concepts (e.g. Type II b turns) - They are more predictable than H/S/C
37HMMSTR
- I-sites often overlap (sequences of sites
corresponds to traditional local structures) - Basic idea Hidden Markov Model for sequences of
I-sites - No in/dels. States specifydistributions of
- Amino acids
- secondary structures
- f/j angles (discretized)
- structural context
38Simple HMMSTR
- Simple example for well knownstructural motif
- Combination of twoI-sites which overlap
- States defined bypositions in an I-site
- Alternative pathsfor different I-sites
39- Whole HMMSTRmodel
- Each node hasstart probability
- Specifies transitionsbetween typesof local
structureas well as within them
40Training of HMMSTR
- Many ad hoc approaches based on biological
intuitions - When to merge overlapping states?
- Dynamic programming to find likely transitions
between I-sites - Null transition state to connect otherwise
disconnected subtrees. - Model surgery adding, splitting and deleting
states. - Structure predictions by voting rather than
most probable parse.
41Beating HMMSTR
- OK, but not great results in predictive accuracy.
- Too many alternative paths through the model, and
difficulty choosing between them on the basis of
sequence alone. - Only local information no global measures used.
- Rosetta add global information to I-site
assignments and get a big improvement
42Rosetta prediction method
- Define global scoring function that estimates
probability of a structure given a sequence - Generate version of I-sites with fixed length
subsequences (9 amino acids) - Calculate P(I-Sitesequence) for all sequences
and I-sites - Generate structures by Monte Carlo sampling of
assignments of fixed size I-sites to subsequences - End up with ensemble of plausible structures
43Rosetta Scoring Function
- Global scoring function issues
- Distinguish native-like structures from not.
Generation methods unlikely to produce exact
native structure. - Decoy testing. Create many structures that are
plausible and not too far from native fold, and
try to distinguish these - Bayesian approach
- Sequence dependent and sequence independent
evaluation of predicted structure.
44Score Decomposition
45Good Performance
- An ab initio target
- Red correct, Grey incorrect
- Missed a sheet
- Good overalltopology
46And bad
- Hardest structure forall prediction methods
- Central sheet regioncontains loops andtwo small
helices - Single hydrogen bondextends and alterstwo
substructures