Title: Predicting Structural Features
1Predicting Structural Features
2Structural Features
- Phosphorylation sites
- Transmembrane helices
- Protein flexibility
3Accuracy Measures Revisited
- Level
- Individual residues
- Complete helix or strand
4Residue-Level Measures
- Q3
- Percentage of residues predicted correctly
- If one state (eg, Coil) is very common (eg, 50),
blind guessing can give a large Q3! - Matthews correlation coefficient
- C (TPxTN - FNxFP)/v(TPFP)(TPFN)(TNFP)(TNFN)
- Defined for each state
- More balanced than Q3 in range 1
- Random prediction C 0
5Structural Element-Level Measures
- SOV
- based on the overlap of predicted segments of
helix, strand etc. with the observed segments of
the same type - The N-score
- specialized for transmembrane protein predictors
- Should TMHMM2 be changed? Should your model?
6Predicting Helices
- Residue propensities
- score for a given structure class for each
residue, a - P(H a) is proportional to P(a H) / P(a)
- Why? Bayes Rule is your friend!
- P(H a) P(a H)P(H) / P(a)
- P(H) doesnt depend on a, so
- P(H a) proportional to P(a H) / P(a)
Can this be used to see how to group helix states?
7Identical short segments rarely fold differently
- Local sequence is highly important to secondary
structure. - But, this sequence occurs in two proteins and
takes very different forms - KGVVPQLVK
- There is significant information about structure
in local sequence.
8I-sites Sequence Database
- About 250 short segments (3-19 residues) that
show strong correlation between sequence and
structure - Example shows
- phi and psi angles, log-odds matrix
- superimposed backbones
- representative structure
9Nearest Neighbor Prediction Methods
- Predict secondary structure based on
- Local alignments of the query sequence to a
database of sequences of known structure - Alignment score functions are often
special-purpose, and may include helix/sheet/coil
propensity information - Homologous sequences are often included in the
database - Prediction based on weighted votes of nearest
neighbors (usually only central residue of
alignment is predicted) - 73.5 Accuracy (Q3)
10A different application prediction of misfolding
- Diseases such as Alzheimers involve protein
misfolding. - Usually, the misfolded region ends up as
Beta-strands. - How could we use secondary structure information
to predict which proteins will potentially
misfold?
11H?PHidden Beta Propensity
- Key idea Tertiary contacts (TC)
- TC is number of contacts a residue has with
others at least 4 residues away - Alpha helices tend to be in regions of HIGH TC
- Beta strands tend to be in regions of LOW TC
- Look for query residues whose nearest neighbors
are strange with respect to TC and alpha/beta
state - Low TC regions with lots of Alphas
- High TC regions with lots of Betas
- Performance results?
12Neural Nets
- Each node computes a simple function of its
inputs. - The weighted sum of the inputs are added to a
bias term and squashed - I ? w??-1
- ??(I?)
- The output, ?, is then propagated to nodes in the
next layer.
13Training Neural Nets
- Back-propagation
- Optimizes the weights and bias terms
- Minimize the error function (difference between
predicted and observed) - RMS
- Relative Entropy
- Iterative process
- Final weights shown for a secondary structure NN
alpha helix output layer. - Over-fitting can be reduced by training for fewer
iterations
14Adaptive Encoding and Weight Sharing
- Orthogonal encoding
- Each residue feeds three hidden nodes
- The weights for all red nodes are tied together
- Each group of three nodes learns the same
encoding of the 20 amino acids
15Engineering Intuition Into NNs
- Alpha helices have a period of 3.6 residues per
turn - A NN can be specially designed to reflect that
- Using this, plus adaptive encoding
- Q3 66
- Adding homology Q3 73
16HMMs and Transmembrane Proteins (again)
17HMMTOP Architecture
- TMHs 17-25 residues
- Tails 1-15 residues
- Blue letters show structural state labels
18TMHMM Architecture
- Helices are 5-25 residues
- Caps follow helices
- Cytoplasmic
- Loop 0-20 residues
- Globular 1 state
- Extra-cellular
- Long loop 0-100 residues
- Globular 3 states
19Predicting Globular Proteins with Hidden Neural
Networks
- YASPIN
- Neural net predicts seven classes (He,H,
Hb,C,Ee,E,Eb) using 15-residue window of PSSM
input - HMM filters this output
- Can you imagine how this is done?
20Coiled-coil HMMMARCOIL
Design lets you start and end in any phase of the
heptad repeat
21Support Vector Machines SVMs
- Classifiers
- Basic machine is a 2-class classifier
- Training Data
- set of labeled vectors
- ltx1, x2, ,xn, Cgt,
- Class C1 or C-1
- Supervised learning (like neural nets)
- Learn from positive and negative examples
- Output
- Function predicting class of unlabeled vectors
22SVM Example
- Alpha helix predictor
- 15 residue window
- 21 numbers per residue
- Psi-BLAST PSSM 20 numbers
- spacer flag indicating off end of protein
- 315 numbers total per window
- Training samples
- Non-helix samples ltx1, x2, , x315, -1gt
- Helix samples ltx1, x2, , x315, 1gt
- Training finds function of X that best separates
the non-helix from the helix samples
23SVM vs NNas Classifiers
- Similarities
- Compute a function on their inputs
- Trained to minimize error
- Differences
- NNs find any hyperplane that separates the two
clases - SVMs find the maximum- margin hyperplane
- NNs can be engineered by designing their topology
- SVMs can be tailored by designing the kernel
function
24SVM Details
Separating Hyperplanes
Choose w, b to minimize w Subject to
Dual form (support vectors)
Kernel trick replace dot products by a
non-linear kernel bunction.
s.t.
where
25Dubious Statement
- In marked contrast to NN, SVMs have few explicit
parameters to fit - The vector of weights, w, is as long as the
number of training samples - But the minimum-margin hyperplane will have most
of the weights equal to zero only the support
vectors will have non-zero weights.