Protein Structure Prediction

About This Presentation

Title:

Protein Structure Prediction

Description:

Relation to existing structures, ab initio, homology, fold, etc. ... Ab Initio. No similar structures in DB. Most fundamental problem. Other issues ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 47

Provided by: compbi

Category:

Tags: in | initio | prediction | protein | structure

more less

Transcript and Presenter's Notes

Title: Protein Structure Prediction

1
Protein Structure Prediction
2
Protein structure

Most proteins will fold spontaneously in water,
so amino acid sequence alone should be enough to
determine protein structure
However, the physics are daunting
20,000 protein atoms, plus equal amounts of
water
Many non-local interactions
Can takes seconds (most chemical reactions take
place 1012 --1,000,000,000,000x faster)
Empirical determinations of protein structure are
advancing rapidly.

3
Protein review

Proteins are polymers of amino acids linked by
peptide bonds.
Properties of proteins are determined by both the
particular sequence of amino acids and by the
conformation (fold) of the protein.
Flexibility in thebonds around C?
? (phi)
? (psi)
sidechain

4
Protein Structure Levels

Protein structure is described in four levels
Primary structure amino acid sequence
Secondary structure local (in sequence) ordering
into
(?)Helices compressed, corkscrew structures
(?)Strands extended, nearly straight structures
(?)Sheets paired strands, reinforced by hydrogen
bonds
parallel (same direction) or antiparallel sheets
Coils, Turns Loops changes in direction
Tertiary structure global ordering (all
angles/atoms)
Quaternary structures multiple, disconnected
amino acid chains interacting to form a larger
structure

5
Protein structure cartoons
6
Protein Structure Representations

Differentvisualizationsshow variousaspects
ofstructure

7
Protein Folding

Proteins are created linearly and then assume
their tertiary structure by folding.
Exact mechanism is still unknown
Mechanistic simulations can be illuminating
Proteins assume the lowest energy structure
Or sometimes an ensemble of low energy
structures.
Hydrophobic collapse drives process
Local (secondary) structure proclivities
Internal stabilizers
Hydrogen bonds, disulphide bonds, salt bridges.

8
Empirical structure determination

Two major experimental methods for determining
protein structure
X-ray Crystallography
Requires growing a crystal of the protein
(impossible for some, never easy)
Diffraction pattern can be inverse-Fourier
transformed to characterize electron densities
(Phase problem)
Nuclear Magnetic Resonance (NMR) imaging
Provides distance constraints, but can be hard to
find a corresponding structure
Works only for relatively small proteins (so far)

9
X-ray crystallography

X-rays, since wavelength is near the distance
between bonded carbon atoms
Maps electron density, not atoms directly
Crystal to get a lot of spatially aligned atoms
Have to invert Fourier transform to get
structure, but only have amplitudes, not phases
Guess! orperturb...

10
NMR structure determination

NMR can detect certain features of hydrogen
atoms
NOESY measures distances between non-bonded H's
within about 5A
COSY and TOCSY described relations through bonds
Combination of distance and angle constraints,
plus knowledge of covalent bonds (amino acid
sequence) determines a unique (sometimes)
structure.
Overlapping measurement limits size 120AA

11
Why predict protein structure?

Neither crystallography nor NMR can keep pace
with genome sequencing efforts
Only 10566 (3641 with lt90 identity) human
proteins in PDB, although growing fast
Computer scientists love this problem
Understandable with minimal biology
Seems like a good discrimination task
Understand the mechanisms of folding (?)
First computational Nobel prize?

12
Kinds of Structure Prediction

Comparative modelling
Homolog has known structure, which is adjusted
for sequence differences
Energy minimization and molecular dynamics
Fold recognition
Proteins fall into broad fold classes. Models of
folds that recognize compatible sequences.
Inverse problem
Predict more than fold class?
Ab initio or new fold prediction
No homologs, not recognized by any fold model

13
Ab Initio predictions

Three broad approaches
Molecular dynamics, energy minimization
approaches
Empirical black box (induce discriminators)
Mechanistic (follow the actual folding path)
approaches. Hybrid between energy and empirical
methods.
Secondary structure predictions
Not tremendously useful nor accurate, but
simplest.
Can play a role in tertiary predictors
Tertiary structure predictions
Best involve a complex mixture of approaches

14
Energy Minimization

Many forces act on a protein
Hydrophobic inside of protein wants to avoid
water
Packing atoms can't be too close, nor too far
away
Bond angle/length constraints
Long distance, e.g.
Electrostatics Hydrogen bonds
Disulphide bonds
Salt bridges
Can calculate all of these forces, and minimize
Intractable in general case, but can be useful

15
Empirical models

Pose structure prediction as induction task.
What are the inputs and outputs?
Where do we get enough training data?
Which induction methods work best?
Long history in bioinformatics

16
Initial approaches to secondary structure
prediction

Input is a "sliding window" of immediately
surrounding sequence assumed to determine
structure (no long distance interactions)
...mnnstnssnsgla...
H
Output is one of three possible secondary
structure states helix, strand, other

17
Why might this work?

There are local propensities to secondary
structural classes (largely hydropathy)
Helices no prolines, sometimes amphipathic (show
alternating hydropathy with period 3.6 residues)
Strands either alternating hydropathy or ends
hydrophillic and center hydrophobic
Neither small, polar flexible residues.
Prolines.
Minimum lengths for secondary structures (helices
longer than strands)

18
Early methods

Chou-Fasman method looked at frequency of each
amino acid in window
GOR defined an information measure I(SR)
logP(SR)/P(S) where S is secondary structure
and R is amino acid. Define information gain as
I(SR) - I(SR) and predict state with
highest gain.
How to combine info gain for each element of
sliding window? Independently (just add) or by
pairs

19
How well did they work?

Not very Roughly 50-55 accurate on a residue by
residue basis.
Random prediction that obeyed the observed
distribution of helix/strand/other would be 40
Different ways to calculate "correctness"
Needs to be unbiased (especially wrt homology)!
Getting number of helices and strands or order
right is harder than just counting residue by
residue (like the difference between nucleotide
and exon level gene finding).

20
Fancier induction techniques

Same setup as Chou-Fasman or GOR
Sliding window across amino acid sequence as
input
Three class output (helix/sheet/other)
Various different induction techniques over same
data, give modest improvements
LDA/QDA
Decision trees
Neural networks
Best results from neural networks ( 62)

21
Add multiple sequence alignment information

This is helpful in principle
insertions/deletions more likely to be coil/turn
conserved hydropathy more important for
prediction than non-conserved.
GOR method improves 8-9 points (to about 64
correct residue by residue).
Similar improvement for NNs (to 68)
SVMs gain a bit more, to about 70

22
But the information isn't there

Prediction quality has not improved much even
with huge growth of training data.
Secondary structure is not completely determined
by local forces
Long distance interactions do not appear in
sliding window
Empirical studies show same amino acid sequences
can assume multiple secondary structures.

23
Mechanistic models

Move from purely empirical to include some
knowledge of folding mechanisms
Compact nature of conformations
Hydrophobic packing
Sequences of secondary structures
Secondary structure predispositions
Heuristic global energy minimization

24
Hydrophobic packing models

Dill's HP model
Two classes of amino acids, hydrophobic (H) and
polar (P)
Lattice model for position of (point) amino
acids.
Thread chain of H's and P's through lattice to
maximize number of H-H contacts

3D
2D
25
But...

Even the 2D HP packing problem (which is easier
than the 3D one) turns out to be NP complete!
Good approximation results exist.
3/8 of optimal approximation (3D)
In triangular lattice, algorithm for gt60 of
optimal packing
Other interesting results in the model, e.g.
Which sequences have a single optimal fold?

26
CASP changed the landscape

Critical Assessment of Structure Prediction
competition. Even numbered years since 1994
Solved, but unpublished structures are posted in
May, predictions due in September, evaluations in
December
Various categories
Relation to existing structures, ab initio,
homology, fold, etc.
Partial vs. Fully automated approaches
Produces lots of information about what aspects
of the problems are hard, and ends arguments
about test sets.
Results showing steady improvement, and the value
of integrative approaches.

27
CASP 6 Categories

Human intervention versus fully automated
predictions
Comparative modeling
A structure exists for a good homolog
Looking for mutations, bond rotations, etc.
Fold recognition (Homologs)
Distant homolog recognition and adaptation
Looking at loop placement, domain boundaries
Fold recognition (Analogous)
No homolog, but similar structures in DB
Finding the right model structure
Ab Initio
No similar structures in DB. Most fundamental
problem.
Other issues
Domain boundaries, disordered regions,
residue-residue contacts

28
CASP Results

Fully automated methods now nearly as good as
ones with human intervention
Consensus methods (looking for agreement among
servers) do best overall, but not by much and not
all the time.
Consistent best approach is Rosetta from David
Bakers lab

29
CASP performance improving
30
Baker best strategy so far

Two step process
Generate a good sized collection of plausible
structures and near-miss bad structures
Requires a good energy function, good
optimization approach
Quality of decoy (incorrect, but plausible
folds) is important
Build discriminators to separate correct from
decoy structures.
Rosetta (Baker lab) and fully automated Robetta.
Ran away with CASP4, still the best at CASP5 6
Robetta almost as good as Rosetta
Outstanding at ab initio, competitive at the
rest.

31
Rosetta approach

Integrated method
I-Sites much finer grained substructures than
secondary structures. A library of all
consistent structures of short polypeptides is
defined (taken from PDB)
Build initial models by assigning I-sites to new
amino acid sequence (many possibilities)
Monte Carlo search through assignments of I-Sites
to minimize energy function.
Use of sophisticated global energy function
Take good scoring structures, and test them on a
decoy detector, which looks for high scoring
but non-native structure patterns.

32
I-Sites

I-sites are a set of sequence patterns that
strongly correlate with protein structure at the
local level.
Ungapped amino acid sequence motifs
Length 3-9 (now longer)
Originally 82 classes (now more)
Defined by amino acid log odds matrix and phi/psi
angles
Far more detailed than the 3 state
helix/sheet/other local structure models

33
Example I-Site

Proline containing alpha helix C-cap

f/j
cartoon
AA log odds
member structures
Motif position
34
How I-sites are defined

Starting from all sequences in PDB at the time
Remove sequences with 25 or greater sequence
identity to compensate for oversampling of
certain families
Cluster all possible subsequences of these
structures of length 3-15.
For each cluster, define paradigm structure.
Remove members that are too far away structurally
Add new members that are structurally similar
If can't distinguish well (bimodal scores)
between members and non-members, drop the cluster

35
I-sites are not unique

One amino acid subsequence may be compatible with
several I-sites
I-sites are not defined to be mutually exclusive
over sequence.
Slightly different starting positions or lengths
may yield quite different (even incompatible)
I-sites for the same sequence region.
This has biological relevance
Local predispositions are not determinative or
unique
Multiple predispositions are more informative
than none.

36
I-sites pro and con

Lots of ad hoc fiddling to get I-site library
Distance measure on sequence has two free
parameters
Many different structure distance measures tried
K-means clustering (K is free parameter)
Test for bimodal scoring (two more parameters)
Occasional subdivision of an I-Site that seemed
to have two good structures associated with it
Corresponds reasonably well to existing
crystallographic concepts (e.g. Type II b turns)
They are more predictable than H/S/C

37
HMMSTR

I-sites often overlap (sequences of sites
corresponds to traditional local structures)
Basic idea Hidden Markov Model for sequences of
I-sites
No in/dels. States specifydistributions of
Amino acids
secondary structures
f/j angles (discretized)
structural context

38
Simple HMMSTR

Simple example for well knownstructural motif
Combination of twoI-sites which overlap
States defined bypositions in an I-site
Alternative pathsfor different I-sites

Whole HMMSTRmodel
Each node hasstart probability
Specifies transitionsbetween typesof local
structureas well as within them

40
Training of HMMSTR

Many ad hoc approaches based on biological
intuitions
When to merge overlapping states?
Dynamic programming to find likely transitions
between I-sites
Null transition state to connect otherwise
disconnected subtrees.
Model surgery adding, splitting and deleting
states.
Structure predictions by voting rather than
most probable parse.

41
Beating HMMSTR

OK, but not great results in predictive accuracy.
Too many alternative paths through the model, and
difficulty choosing between them on the basis of
sequence alone.
Only local information no global measures used.
Rosetta add global information to I-site
assignments and get a big improvement

42
Rosetta prediction method

Define global scoring function that estimates
probability of a structure given a sequence
Generate version of I-sites with fixed length
subsequences (9 amino acids)
Calculate P(I-Sitesequence) for all sequences
and I-sites
Generate structures by Monte Carlo sampling of
assignments of fixed size I-sites to subsequences
End up with ensemble of plausible structures

43
Rosetta Scoring Function

Global scoring function issues
Distinguish native-like structures from not.
Generation methods unlikely to produce exact
native structure.
Decoy testing. Create many structures that are
plausible and not too far from native fold, and
try to distinguish these
Bayesian approach
Sequence dependent and sequence independent
evaluation of predicted structure.

44
Score Decomposition
45
Good Performance