Fold Prediction

About This Presentation

Title:

Fold Prediction

Description:

HMAP and AGAPE talk about evaluation based on comparison to structural alignments. ... AGAPE: Predicting Without Folds ... AGAPE: Extreme Value Distribution ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 40

Provided by: Huz4

Category:

more less

Transcript and Presenter's Notes

Title: Fold Prediction

1
Fold Prediction

-Huzefa Rangwala
(rangwala_at_cs.umn.edu

2
Problem Definition

Hierarchy
According to the SCOP database proteins are
classified to reflect both the structural
evolutionary relationships
Same family gt clear evolutionary relationship
Same super-family gt probably common evolutionary
origin
Same fold gt Major structural similarity
Fold Prediction
To recognize similarity at all the above three
levels by defining independent problems.
Sequence-Structure Homology Recognition
To recognize similarity at only family and
super-family levels
Does this imply that at the fold prediction level
structural information will play a significant
part compared to just sequence information?

3
Techniques (Groups)

Pairwise sequence comparisons recognize close
homologs
Use of Profiles/ HMM relying only on sequence
information are able to capture distant homolog
relationships
Second scheme some structural information
improve the accuracy
Eg- Fugue, HMAP, 3D-PSSM (Todays focus)
Threading methods (sequence comparisons with a
structural template) Show a significant
performance for the fold level

4
Schemes to be discussed

3D-PSSM
FUGUE
HMAP
AGAPE

5
Evaluation Ranking

Each of the schemes will give us a similarity
score between two protein domains whose
relationship we are trying to figure out i.e
whether they are part of the same family,
superfamily or fold?
We sort the pairs of relationships based on the
similarity scores and get an ordered rank to
compute true positives and false positives. Based
on these two statistics we compute the ROC,
Coverage, Specificity or Sensitivity.
Defining True Positive False Positive
TP At the superfamily level domain pairs that
share the same superfamily but different family
level. (Same family relationships are to trivial
to predict)
FP Having different fold levels are considered
as FPs. If the pairs have the same folds but
different superfamily they are neglected.
Similar definitions hold for the other benchmark
levels
Different schemes use this same strategy but
restrict the pairs they compare. HMAP, 3D-PSSM
does the overall ranking to compute the
statistics whereas FUGUE defined these statistics
on a per test domain level

6
Evaluation Alignment Accuracy

HMAP and AGAPE talk about evaluation based on
comparison to structural alignments.
HMAP uses the structural alignment between PrISM
between pairs as the gold standard.
Two measures Full accuracy and core accuracy
(positions within the secondary structure
regions) are evaluated.

7
SS 3D-PSSM (Kelley et al.)

Goal enhance fold detection by using profiles
that leverage structural similarity.
Overview
Other fold-detection methods compute distance
between a query sequence (unknown fold) and a
library sequence (known fold) using a standard
PSSM.
3D-PSSM uses extra information
Solvation potentials per residue
Secondary structure
3D-PSSM obtained from a super PSSM calculated
from a structural alignment between sequences in
the super-family

8
SS 3D-PSSM (residue features)

Residue features and scoring functions between
query and library sequences
Degree of burial - Score is solvation
potentials per residue (frequency of occurrence
of amino acid with a specific degree of burial
relative to all amino acid types with this degree
of burial).
Secondary structures - as calculated by STRIDE
for library, PSI-Pred for query. Score is 1 for
matching elements, -1 for mismatches.
1D / 3D-PSSM - (next slide). Score is profile
to profile score, as in PSI-Blast.

9
3D Profile Whats going on ?
10
SS 3D-PSSM (building 3D-PSSM)

3D-PSSM for a master sequence (built offline)
Perform a 3D alignment using SAP (Orengo, 1992)
between master sequence and all sequences in
superfamily.
Select similar structures (RMS lt 6A)
Create a structural MSA
Start with the closest sequence to the master
Iteratively align the sequence the sequence
structurally closest to any of the sequences
already in the MSA
The 3D-PSSM is the cumulative PSSM made by
combining all aligned PSSM's.

11
SS 3D-PSSM (searching)

For a given query, compute the score to every
library sequence
Use a global DP alignment with pairwise scores
S(Xi, Yj) Spssm(Xi,Yj)
Ssecondary(Xi,Yj) Ssolvation(Xi,Yj)
Perform three different DP passes
Library sequence is matched to query PSSM
Query sequence is matched to library 1D-PSSM
Query sequence is matched to library 3D-PSSM
Take the maximum score from the three passes

12
SS3D-PSSM (evaluation)

Now that we have a score for a sequence pair, we
need to determine it's significance.
We do this by fitting a linear function to
log(number of hits up to a score) vs.
log(sum of all scores)
Effectiveness
136 sequences in the test set that met critera
PSI-Blast assignment failed
There was a homology to a protein with a 3D
profile
Only kept one test sequence per 3D profile.
18 of the 136 sequences were correctly classified

13
CKFUGUE Remote
Homology Detection using Known Structure
Information

Big Picture FUGUE constructs structural profiles
for families of known structures. Each profile is
converted into a scoring matrix (like PSI-BLAST)
which is used to align the sequence. (z-Scores)
Random alignments are conducted to obtain mean
scores for a structural profile and a query
sequence is considered a aligned well if its
score is significantly higher than the mean.

14
CK Basics of FUGUE
Alignment

Start with a collection of structure alignments -
source is pruned version of HOMSTRAD 177
families (aligned), 706 total structures
Want to build a scoring matrix based on the
structures in a family
Count substitutions between amino acids in
(structurally) aligned positions in the families
Catch When residue A and B appear in aligned
positions, don't just increment the counts - also
consider the environment in which the two appear.

15
CK Environment

Three categories of environments were considered,
each with several classes.
Secondary structure alpha-helix, beta-strand,
irregular structure (coil), or positive phi
main-chain angle (does this mean main chain, i.e.
no 2ndary structure??) - 4 classes
Solvent accessibility 7 or greater is
accessible, o/w inaccessible - 2 classes
Hydrogen bonds Combination of 3 T/F conditions,
side-chain to side-chain or not, side-chain to
main-chain NH or not, side-chain to main-chain CO
or not - 8 classes
An environment is a specification of class for
each of these 3 categories - total of 64
environments. Each residue in the structural
families fits into one of these classes

16
CK Substitution Matrix Formulation

Construct a substitution matrix of log odds
scores, each entry is log(P(BA,E))/q_B q_B is
background prob of B
Background probability is not taken from a Blosum
matrix calculated from the occurrence of a
residue in the structure family (eqn 3??)
Entries are actually a weighted combination of
the above log-odds score and the score from an 'a
priori' distribution calculated based on another
study
Some residues excluded (masked) from substitution
calculation domain-domain interaction residues
and interaction with heteroatoms(??)

17
CK Example of Substitution Matrix Calc
18
CK Profile Calculation

Calculate a weight for each structure based on
the sum of its dis-similarity to all other
structures divided by total dis-similarity of all
structures in the family (dis-similarity is
fraction of identical residues (??))
The scoring matrix has as many columns as the
longest structure and as many rows as there are
structures in the family alignment. Entries in
the scoring matrix are calculated as follows
To calculate entry at column P (position) and row
B (amino acid), do the following for every
structure. Its amino acid at position P is A in
environment E. Add to the scoring matrix entry
the weight of the structure multiplied by
S(A,E?B) a score from the substitution matrix.

19
CK Gap Penalty Calculations

Gap penalties modified based on secondary
structure of position
Insertion/Deletion highly penalized in the middle
of secondary structure region highly penalized,
in unordered (coil) regions lightly penalized
Overall penalty at a position determined by
weighted sum of penalties for each structure at
that position
Numerical penalty scores (high, low, very low,
etc) determined numerically from random
simulation
Position specific gap scores calculated for each
family, modified by solvent accessibility (??)

20
CK Evaluation

Used a benchmark defined in Shi et. al 976
domains from SCOP.
For domain D (out of the test set), run FUGUE
with D as the query sequence to generate a list
of scores for each other domain. Sort these
scores. Calculate specificity and sensitivity
curves. Each domain D has a psi-blast profile.
Compared performance at finding relatives in SCOP
family, super-family, and fold levels
For family level, count TPs as family members.
For super-family level, count TPs as super-family
members - family members. Likewise for fold level
(fold members - SF - F)
Outperformed other methods on family and
super-family, beaten by THREADER at fold level.
Several other methods used to evaluate various
aspects of FUGUE

21
HMAP Hybrid multidimensional alignment profiles

Big Picture A series of hybrid multidimensional
alignment profiles that combine sequence,
secondary and tertiary structure information were
developed. They conclude with their evaluation
that secondary structure plays a significant part
in remote homology detection over 3D structures.

22
HMAP The setting
23
KD,NWHMAP Details

Selecting the correct subset of sequences
structures
No sequence should have greater than 40 sequence
identity
Structures of at least 2.5 Angstrom
More than 50 SSE are excluded
Sequence-based profiles are generated using
PSI-BLAST
Database is restricted by CD-HIT to 65 identity

24
KD,NWHMAP Details

Multiple structure based sequence alignments and
structure based profiles
For each template structure based on the PSD
score (lt1) get the closest neighbors.
Sequence for each of these neighbors is used as a
seed into PSI-BLAST.
Keep only those sequences above a certain
threshold
Purged sequence alignments are combined with a
sequence using multiple structure alignment as a
guide. (SEMSA)
Get a similar PSSM -gt 3D Profile

25
KD,NWHMAP Details

Secondary Structure Profiles
Just like in the previous step from the multiple
structure alignment, create a secondary profile
where instead of the 20 amino acids we use the
three secondary structure elements
Motifs
In regions where they are confident (core/motif)
about the structure from the alignments, they
incorporate structural information. In regions
where they are not confident (loops), they revert
to 1D info.

26
KD,NWHMAP(1,2) HMAP(1,2,3)

HMAP(1,2) 1d Profile secondary structure
profiles
HMAP(1,2,3) interleaving of 1d profile with 3d
profile based on motif information secondary
structure information.

27
KD,NWQuery Profiles Alignments

Query profiles are generated using PSI Blast
profile and predicted secondary structure
information (predicted using JNET, PSIPRED and
PHD - averaging).
Simple DP but the scoring is a dot product of the
profiles weighted by the secondary structure as
well as the conservation factors. (NEXT SLIDE)
Specific-Structure gap penalties like the FUGUE
but the implementation is that the penalties are
based on the secondary structure profiles.

28
Similarity score for HMAP
29
KD,NWPSD Details
30
KD,NWPSD Details

a of SSEs for protein A
S(A,B) the score from the double dynamic
programming algorithm aligning SSEs of the pair
of proteins A and B.
x and y are experimentally determined parameters

31
HMAP Evaluation

HMAP evaluation was done on the ranking to
compute the coverage vs accuracy defined earlier
and it was seen that 1d-profile was better than
PSI-Blast ranking
The performance of HMAP(1,2) was better than
HMAP(1,2,3) for the super-family level and
comparable at the fold level. Can we conclude if
3D information does add anything significant for
the remote homology detection problem?

32
DRAGAPE Predicting Without
Folds

Big Picture Appending profile information with
secondary structure and solvent accessibility.
They claim that use of explicit 3D structural
information as used in other schemes may not be
very helpful
They build generalized profiles based on 1D
structure predictions (i.e prediction of
secondary structure and solvent accessibility
using PROFphd). Some preliminary testing is done
on known structures using DSSP
Six dimensions are added to the profile based on
the predictions i.e either buried or exposed and
helix, strand or other

33
DR Generalized PSSM

GPSSM(i, j, 1D(i), 1D(j))
(r) PSSM(i, j) (1-r) SM(1D(i), 1D(j))
GPPS Generalized Position-Spec. Scoring Matrix
PSSM from PSI-BLAST
SM 1D substitution matrix (learnt by aligning
predicted predicted or known-known or
combinations of 1D structures)
i query position j database position

34
DR Methods Fold recognition
(SW-PSSM)

Obtaining and comparing 1D structures
O2O Observed against Observed (DSSP)
P2O Predicted (PROFphd) against Observed
P2P
P2P-Bis
Predicted profile against predicted DB sequence
AND
Predicted profile sequence against DB profile

35
DR A simple DP solution
Image from Protein fold recognition by
prediction-based threading. Journal of molecular
biology 0022-2836 Rost yr 1997 vol 270 iss 3
pg 471
36
DRStatistical Significance

Fitting the scores to an extreme value
distribution and modifying the scores to get a
significance estimate thereby gaining an
improvement in the ranking.
They introduced bidirectional scoring which shows
improved performance

37
DR(Good?) Prediction Errors

P2P does better without using observed because
the errors correlate
i.e., 1D prediction mistakes correlate between
proteins with dissimilar sequences but similar
structures
Meanings/implications (?)

38
DRResults

Using 1D data is
Always better that sequence only methods (BLAST)
Currently better than some that use 3D
information
However, in the long-run, 3D-including methods
will probably be better
Slow uses DP
Why? Though all the other methods also use some
sort of DP

39
Statistical Significance

Fugue z-Score
HMAP Mixed Distribution Fit
The scores are made to fit a mixed distribution
which was estimated empirically based on
shuffling amino acid profiles and aligning
shuffled sequences to the original with positions
of secondary structure elements fixed.
3D-PSSM
The significance of a match was evaluated by
fitting a linear relationship between log (number
of hits up to a score) against log (total score)
AGAPE Extreme Value Distribution
The significance of a match was evaluating by
fitting it to an extreme value distribution.