Title: Fold Prediction
1Fold Prediction
- -Huzefa Rangwala
- (rangwala_at_cs.umn.edu
2Problem Definition
- Hierarchy
- According to the SCOP database proteins are
classified to reflect both the structural
evolutionary relationships - Same family gt clear evolutionary relationship
- Same super-family gt probably common evolutionary
origin - Same fold gt Major structural similarity
- Fold Prediction
- To recognize similarity at all the above three
levels by defining independent problems. - Sequence-Structure Homology Recognition
- To recognize similarity at only family and
super-family levels - Does this imply that at the fold prediction level
structural information will play a significant
part compared to just sequence information? -
3Techniques (Groups)
- Pairwise sequence comparisons recognize close
homologs - Use of Profiles/ HMM relying only on sequence
information are able to capture distant homolog
relationships - Second scheme some structural information
improve the accuracy - Eg- Fugue, HMAP, 3D-PSSM (Todays focus)
- Threading methods (sequence comparisons with a
structural template) Show a significant
performance for the fold level
4Schemes to be discussed
5Evaluation Ranking
- Each of the schemes will give us a similarity
score between two protein domains whose
relationship we are trying to figure out i.e
whether they are part of the same family,
superfamily or fold? - We sort the pairs of relationships based on the
similarity scores and get an ordered rank to
compute true positives and false positives. Based
on these two statistics we compute the ROC,
Coverage, Specificity or Sensitivity. - Defining True Positive False Positive
- TP At the superfamily level domain pairs that
share the same superfamily but different family
level. (Same family relationships are to trivial
to predict) - FP Having different fold levels are considered
as FPs. If the pairs have the same folds but
different superfamily they are neglected. - Similar definitions hold for the other benchmark
levels - Different schemes use this same strategy but
restrict the pairs they compare. HMAP, 3D-PSSM
does the overall ranking to compute the
statistics whereas FUGUE defined these statistics
on a per test domain level
6Evaluation Alignment Accuracy
- HMAP and AGAPE talk about evaluation based on
comparison to structural alignments. - HMAP uses the structural alignment between PrISM
between pairs as the gold standard. - Two measures Full accuracy and core accuracy
(positions within the secondary structure
regions) are evaluated.
7 SS 3D-PSSM (Kelley et al.)
- Goal enhance fold detection by using profiles
that leverage structural similarity. - Overview
- Other fold-detection methods compute distance
between a query sequence (unknown fold) and a
library sequence (known fold) using a standard
PSSM. - 3D-PSSM uses extra information
- Solvation potentials per residue
- Secondary structure
- 3D-PSSM obtained from a super PSSM calculated
from a structural alignment between sequences in
the super-family
8 SS 3D-PSSM (residue features)
- Residue features and scoring functions between
query and library sequences - Degree of burial - Score is solvation
potentials per residue (frequency of occurrence
of amino acid with a specific degree of burial
relative to all amino acid types with this degree
of burial). - Secondary structures - as calculated by STRIDE
for library, PSI-Pred for query. Score is 1 for
matching elements, -1 for mismatches. - 1D / 3D-PSSM - (next slide). Score is profile
to profile score, as in PSI-Blast.
93D Profile Whats going on ?
10 SS 3D-PSSM (building 3D-PSSM)
- 3D-PSSM for a master sequence (built offline)
- Perform a 3D alignment using SAP (Orengo, 1992)
between master sequence and all sequences in
superfamily. - Select similar structures (RMS lt 6A)
- Create a structural MSA
- Start with the closest sequence to the master
- Iteratively align the sequence the sequence
structurally closest to any of the sequences
already in the MSA - The 3D-PSSM is the cumulative PSSM made by
combining all aligned PSSM's.
11 SS 3D-PSSM (searching)
- For a given query, compute the score to every
library sequence - Use a global DP alignment with pairwise scores
- S(Xi, Yj) Spssm(Xi,Yj)
Ssecondary(Xi,Yj) Ssolvation(Xi,Yj) - Perform three different DP passes
- Library sequence is matched to query PSSM
- Query sequence is matched to library 1D-PSSM
- Query sequence is matched to library 3D-PSSM
- Take the maximum score from the three passes
12 SS3D-PSSM (evaluation)
- Now that we have a score for a sequence pair, we
need to determine it's significance. - We do this by fitting a linear function to
- log(number of hits up to a score) vs.
- log(sum of all scores)
- Effectiveness
- 136 sequences in the test set that met critera
- PSI-Blast assignment failed
- There was a homology to a protein with a 3D
profile - Only kept one test sequence per 3D profile.
- 18 of the 136 sequences were correctly classified
13 CKFUGUE Remote
Homology Detection using Known Structure
Information
- Big Picture FUGUE constructs structural profiles
for families of known structures. Each profile is
converted into a scoring matrix (like PSI-BLAST)
which is used to align the sequence. (z-Scores)
Random alignments are conducted to obtain mean
scores for a structural profile and a query
sequence is considered a aligned well if its
score is significantly higher than the mean.
14 CK Basics of FUGUE
Alignment
- Start with a collection of structure alignments -
source is pruned version of HOMSTRAD 177
families (aligned), 706 total structures - Want to build a scoring matrix based on the
structures in a family - Count substitutions between amino acids in
(structurally) aligned positions in the families - Catch When residue A and B appear in aligned
positions, don't just increment the counts - also
consider the environment in which the two appear.
15 CK Environment
- Three categories of environments were considered,
each with several classes. - Secondary structure alpha-helix, beta-strand,
irregular structure (coil), or positive phi
main-chain angle (does this mean main chain, i.e.
no 2ndary structure??) - 4 classes - Solvent accessibility 7 or greater is
accessible, o/w inaccessible - 2 classes - Hydrogen bonds Combination of 3 T/F conditions,
side-chain to side-chain or not, side-chain to
main-chain NH or not, side-chain to main-chain CO
or not - 8 classes - An environment is a specification of class for
each of these 3 categories - total of 64
environments. Each residue in the structural
families fits into one of these classes
16 CK Substitution Matrix Formulation
- Construct a substitution matrix of log odds
scores, each entry is log(P(BA,E))/q_B q_B is
background prob of B - Background probability is not taken from a Blosum
matrix calculated from the occurrence of a
residue in the structure family (eqn 3??) - Entries are actually a weighted combination of
the above log-odds score and the score from an 'a
priori' distribution calculated based on another
study - Some residues excluded (masked) from substitution
calculation domain-domain interaction residues
and interaction with heteroatoms(??)
17 CK Example of Substitution Matrix Calc
18 CK Profile Calculation
- Calculate a weight for each structure based on
the sum of its dis-similarity to all other
structures divided by total dis-similarity of all
structures in the family (dis-similarity is
fraction of identical residues (??)) - The scoring matrix has as many columns as the
longest structure and as many rows as there are
structures in the family alignment. Entries in
the scoring matrix are calculated as follows - To calculate entry at column P (position) and row
B (amino acid), do the following for every
structure. Its amino acid at position P is A in
environment E. Add to the scoring matrix entry
the weight of the structure multiplied by
S(A,E?B) a score from the substitution matrix.
19 CK Gap Penalty Calculations
- Gap penalties modified based on secondary
structure of position - Insertion/Deletion highly penalized in the middle
of secondary structure region highly penalized,
in unordered (coil) regions lightly penalized - Overall penalty at a position determined by
weighted sum of penalties for each structure at
that position - Numerical penalty scores (high, low, very low,
etc) determined numerically from random
simulation - Position specific gap scores calculated for each
family, modified by solvent accessibility (??)
20 CK Evaluation
- Used a benchmark defined in Shi et. al 976
domains from SCOP. - For domain D (out of the test set), run FUGUE
with D as the query sequence to generate a list
of scores for each other domain. Sort these
scores. Calculate specificity and sensitivity
curves. Each domain D has a psi-blast profile. - Compared performance at finding relatives in SCOP
family, super-family, and fold levels - For family level, count TPs as family members.
For super-family level, count TPs as super-family
members - family members. Likewise for fold level
(fold members - SF - F) - Outperformed other methods on family and
super-family, beaten by THREADER at fold level. - Several other methods used to evaluate various
aspects of FUGUE
21HMAP Hybrid multidimensional alignment profiles
- Big Picture A series of hybrid multidimensional
alignment profiles that combine sequence,
secondary and tertiary structure information were
developed. They conclude with their evaluation
that secondary structure plays a significant part
in remote homology detection over 3D structures.
22HMAP The setting
23 KD,NWHMAP Details
- Selecting the correct subset of sequences
structures - No sequence should have greater than 40 sequence
identity - Structures of at least 2.5 Angstrom
- More than 50 SSE are excluded
- Sequence-based profiles are generated using
PSI-BLAST - Database is restricted by CD-HIT to 65 identity
24 KD,NWHMAP Details
- Multiple structure based sequence alignments and
structure based profiles - For each template structure based on the PSD
score (lt1) get the closest neighbors. - Sequence for each of these neighbors is used as a
seed into PSI-BLAST. - Keep only those sequences above a certain
threshold - Purged sequence alignments are combined with a
sequence using multiple structure alignment as a
guide. (SEMSA) - Get a similar PSSM -gt 3D Profile
25 KD,NWHMAP Details
- Secondary Structure Profiles
- Just like in the previous step from the multiple
structure alignment, create a secondary profile
where instead of the 20 amino acids we use the
three secondary structure elements - Motifs
- In regions where they are confident (core/motif)
about the structure from the alignments, they
incorporate structural information. In regions
where they are not confident (loops), they revert
to 1D info.
26 KD,NWHMAP(1,2) HMAP(1,2,3)
- HMAP(1,2) 1d Profile secondary structure
profiles - HMAP(1,2,3) interleaving of 1d profile with 3d
profile based on motif information secondary
structure information.
27 KD,NWQuery Profiles Alignments
- Query profiles are generated using PSI Blast
profile and predicted secondary structure
information (predicted using JNET, PSIPRED and
PHD - averaging). - Simple DP but the scoring is a dot product of the
profiles weighted by the secondary structure as
well as the conservation factors. (NEXT SLIDE) - Specific-Structure gap penalties like the FUGUE
but the implementation is that the penalties are
based on the secondary structure profiles.
28Similarity score for HMAP
29 KD,NWPSD Details
30 KD,NWPSD Details
- a of SSEs for protein A
- S(A,B) the score from the double dynamic
programming algorithm aligning SSEs of the pair
of proteins A and B. - x and y are experimentally determined parameters
31HMAP Evaluation
- HMAP evaluation was done on the ranking to
compute the coverage vs accuracy defined earlier
and it was seen that 1d-profile was better than
PSI-Blast ranking - The performance of HMAP(1,2) was better than
HMAP(1,2,3) for the super-family level and
comparable at the fold level. Can we conclude if
3D information does add anything significant for
the remote homology detection problem?
32 DRAGAPE Predicting Without
Folds
- Big Picture Appending profile information with
secondary structure and solvent accessibility.
They claim that use of explicit 3D structural
information as used in other schemes may not be
very helpful - They build generalized profiles based on 1D
structure predictions (i.e prediction of
secondary structure and solvent accessibility
using PROFphd). Some preliminary testing is done
on known structures using DSSP - Six dimensions are added to the profile based on
the predictions i.e either buried or exposed and
helix, strand or other
33 DR Generalized PSSM
- GPSSM(i, j, 1D(i), 1D(j))
- (r) PSSM(i, j) (1-r) SM(1D(i), 1D(j))
- GPPS Generalized Position-Spec. Scoring Matrix
- PSSM from PSI-BLAST
- SM 1D substitution matrix (learnt by aligning
predicted predicted or known-known or
combinations of 1D structures) - i query position j database position
34 DR Methods Fold recognition
(SW-PSSM)
- Obtaining and comparing 1D structures
- O2O Observed against Observed (DSSP)
- P2O Predicted (PROFphd) against Observed
- P2P
- P2P-Bis
- Predicted profile against predicted DB sequence
AND - Predicted profile sequence against DB profile
35 DR A simple DP solution
Image from Protein fold recognition by
prediction-based threading. Journal of molecular
biology 0022-2836 Rost yr 1997 vol 270 iss 3
pg 471
36 DRStatistical Significance
- Fitting the scores to an extreme value
distribution and modifying the scores to get a
significance estimate thereby gaining an
improvement in the ranking. - They introduced bidirectional scoring which shows
improved performance
37 DR(Good?) Prediction Errors
- P2P does better without using observed because
the errors correlate - i.e., 1D prediction mistakes correlate between
proteins with dissimilar sequences but similar
structures - Meanings/implications (?)
38 DRResults
- Using 1D data is
- Always better that sequence only methods (BLAST)
- Currently better than some that use 3D
information - However, in the long-run, 3D-including methods
will probably be better - Slow uses DP
- Why? Though all the other methods also use some
sort of DP
39Statistical Significance
- Fugue z-Score
- HMAP Mixed Distribution Fit
- The scores are made to fit a mixed distribution
which was estimated empirically based on
shuffling amino acid profiles and aligning
shuffled sequences to the original with positions
of secondary structure elements fixed. - 3D-PSSM
- The significance of a match was evaluated by
fitting a linear relationship between log (number
of hits up to a score) against log (total score) - AGAPE Extreme Value Distribution
- The significance of a match was evaluating by
fitting it to an extreme value distribution.