Fold Prediction - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Fold Prediction

Description:

HMAP and AGAPE talk about evaluation based on comparison to structural alignments. ... AGAPE: Predicting Without Folds ... AGAPE: Extreme Value Distribution ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 40
Provided by: Huz4
Category:
Tags: fold | prediction

less

Transcript and Presenter's Notes

Title: Fold Prediction


1
Fold Prediction
  • -Huzefa Rangwala
  • (rangwala_at_cs.umn.edu

2
Problem Definition
  • Hierarchy
  • According to the SCOP database proteins are
    classified to reflect both the structural
    evolutionary relationships
  • Same family gt clear evolutionary relationship
  • Same super-family gt probably common evolutionary
    origin
  • Same fold gt Major structural similarity
  • Fold Prediction
  • To recognize similarity at all the above three
    levels by defining independent problems.
  • Sequence-Structure Homology Recognition
  • To recognize similarity at only family and
    super-family levels
  • Does this imply that at the fold prediction level
    structural information will play a significant
    part compared to just sequence information?

3
Techniques (Groups)
  • Pairwise sequence comparisons recognize close
    homologs
  • Use of Profiles/ HMM relying only on sequence
    information are able to capture distant homolog
    relationships
  • Second scheme some structural information
    improve the accuracy
  • Eg- Fugue, HMAP, 3D-PSSM (Todays focus)
  • Threading methods (sequence comparisons with a
    structural template) Show a significant
    performance for the fold level

4
Schemes to be discussed
  • 3D-PSSM
  • FUGUE
  • HMAP
  • AGAPE

5
Evaluation Ranking
  • Each of the schemes will give us a similarity
    score between two protein domains whose
    relationship we are trying to figure out i.e
    whether they are part of the same family,
    superfamily or fold?
  • We sort the pairs of relationships based on the
    similarity scores and get an ordered rank to
    compute true positives and false positives. Based
    on these two statistics we compute the ROC,
    Coverage, Specificity or Sensitivity.
  • Defining True Positive False Positive
  • TP At the superfamily level domain pairs that
    share the same superfamily but different family
    level. (Same family relationships are to trivial
    to predict)
  • FP Having different fold levels are considered
    as FPs. If the pairs have the same folds but
    different superfamily they are neglected.
  • Similar definitions hold for the other benchmark
    levels
  • Different schemes use this same strategy but
    restrict the pairs they compare. HMAP, 3D-PSSM
    does the overall ranking to compute the
    statistics whereas FUGUE defined these statistics
    on a per test domain level

6
Evaluation Alignment Accuracy
  • HMAP and AGAPE talk about evaluation based on
    comparison to structural alignments.
  • HMAP uses the structural alignment between PrISM
    between pairs as the gold standard.
  • Two measures Full accuracy and core accuracy
    (positions within the secondary structure
    regions) are evaluated.

7
SS 3D-PSSM (Kelley et al.)
  • Goal enhance fold detection by using profiles
    that leverage structural similarity.
  • Overview
  • Other fold-detection methods compute distance
    between a query sequence (unknown fold) and a
    library sequence (known fold) using a standard
    PSSM.
  • 3D-PSSM uses extra information
  • Solvation potentials per residue
  • Secondary structure
  • 3D-PSSM obtained from a super PSSM calculated
    from a structural alignment between sequences in
    the super-family

8
SS 3D-PSSM (residue features)
  • Residue features and scoring functions between
    query and library sequences
  • Degree of burial - Score is solvation
    potentials per residue (frequency of occurrence
    of amino acid with a specific degree of burial
    relative to all amino acid types with this degree
    of burial).
  • Secondary structures - as calculated by STRIDE
    for library, PSI-Pred for query. Score is 1 for
    matching elements, -1 for mismatches.
  • 1D / 3D-PSSM - (next slide). Score is profile
    to profile score, as in PSI-Blast.

9
3D Profile Whats going on ?
10
SS 3D-PSSM (building 3D-PSSM)
  • 3D-PSSM for a master sequence (built offline)
  • Perform a 3D alignment using SAP (Orengo, 1992)
    between master sequence and all sequences in
    superfamily.
  • Select similar structures (RMS lt 6A)
  • Create a structural MSA
  • Start with the closest sequence to the master
  • Iteratively align the sequence the sequence
    structurally closest to any of the sequences
    already in the MSA
  • The 3D-PSSM is the cumulative PSSM made by
    combining all aligned PSSM's.

11
SS 3D-PSSM (searching)
  • For a given query, compute the score to every
    library sequence
  • Use a global DP alignment with pairwise scores
  • S(Xi, Yj) Spssm(Xi,Yj)
    Ssecondary(Xi,Yj) Ssolvation(Xi,Yj)
  • Perform three different DP passes
  • Library sequence is matched to query PSSM
  • Query sequence is matched to library 1D-PSSM
  • Query sequence is matched to library 3D-PSSM
  • Take the maximum score from the three passes

12
SS3D-PSSM (evaluation)
  • Now that we have a score for a sequence pair, we
    need to determine it's significance.
  • We do this by fitting a linear function to
  • log(number of hits up to a score) vs.
  • log(sum of all scores)
  • Effectiveness
  • 136 sequences in the test set that met critera
  • PSI-Blast assignment failed
  • There was a homology to a protein with a 3D
    profile
  • Only kept one test sequence per 3D profile.
  • 18 of the 136 sequences were correctly classified

13
CKFUGUE Remote
Homology Detection using Known Structure
Information
  • Big Picture FUGUE constructs structural profiles
    for families of known structures. Each profile is
    converted into a scoring matrix (like PSI-BLAST)
    which is used to align the sequence. (z-Scores)
    Random alignments are conducted to obtain mean
    scores for a structural profile and a query
    sequence is considered a aligned well if its
    score is significantly higher than the mean.

14
CK Basics of FUGUE
Alignment
  • Start with a collection of structure alignments -
    source is pruned version of HOMSTRAD 177
    families (aligned), 706 total structures
  • Want to build a scoring matrix based on the
    structures in a family
  • Count substitutions between amino acids in
    (structurally) aligned positions in the families
  • Catch When residue A and B appear in aligned
    positions, don't just increment the counts - also
    consider the environment in which the two appear.

15
CK Environment
  • Three categories of environments were considered,
    each with several classes.
  • Secondary structure alpha-helix, beta-strand,
    irregular structure (coil), or positive phi
    main-chain angle (does this mean main chain, i.e.
    no 2ndary structure??) - 4 classes
  • Solvent accessibility 7 or greater is
    accessible, o/w inaccessible - 2 classes
  • Hydrogen bonds Combination of 3 T/F conditions,
    side-chain to side-chain or not, side-chain to
    main-chain NH or not, side-chain to main-chain CO
    or not - 8 classes
  • An environment is a specification of class for
    each of these 3 categories - total of 64
    environments. Each residue in the structural
    families fits into one of these classes

16
CK Substitution Matrix Formulation
  • Construct a substitution matrix of log odds
    scores, each entry is log(P(BA,E))/q_B q_B is
    background prob of B
  • Background probability is not taken from a Blosum
    matrix calculated from the occurrence of a
    residue in the structure family (eqn 3??)
  • Entries are actually a weighted combination of
    the above log-odds score and the score from an 'a
    priori' distribution calculated based on another
    study
  • Some residues excluded (masked) from substitution
    calculation domain-domain interaction residues
    and interaction with heteroatoms(??)

17
CK Example of Substitution Matrix Calc
18
CK Profile Calculation
  • Calculate a weight for each structure based on
    the sum of its dis-similarity to all other
    structures divided by total dis-similarity of all
    structures in the family (dis-similarity is
    fraction of identical residues (??))
  • The scoring matrix has as many columns as the
    longest structure and as many rows as there are
    structures in the family alignment. Entries in
    the scoring matrix are calculated as follows
  • To calculate entry at column P (position) and row
    B (amino acid), do the following for every
    structure. Its amino acid at position P is A in
    environment E. Add to the scoring matrix entry
    the weight of the structure multiplied by
    S(A,E?B) a score from the substitution matrix.

19
CK Gap Penalty Calculations
  • Gap penalties modified based on secondary
    structure of position
  • Insertion/Deletion highly penalized in the middle
    of secondary structure region highly penalized,
    in unordered (coil) regions lightly penalized
  • Overall penalty at a position determined by
    weighted sum of penalties for each structure at
    that position
  • Numerical penalty scores (high, low, very low,
    etc) determined numerically from random
    simulation
  • Position specific gap scores calculated for each
    family, modified by solvent accessibility (??)

20
CK Evaluation
  • Used a benchmark defined in Shi et. al 976
    domains from SCOP.
  • For domain D (out of the test set), run FUGUE
    with D as the query sequence to generate a list
    of scores for each other domain. Sort these
    scores. Calculate specificity and sensitivity
    curves. Each domain D has a psi-blast profile.
  • Compared performance at finding relatives in SCOP
    family, super-family, and fold levels
  • For family level, count TPs as family members.
    For super-family level, count TPs as super-family
    members - family members. Likewise for fold level
    (fold members - SF - F)
  • Outperformed other methods on family and
    super-family, beaten by THREADER at fold level.
  • Several other methods used to evaluate various
    aspects of FUGUE

21
HMAP Hybrid multidimensional alignment profiles
  • Big Picture A series of hybrid multidimensional
    alignment profiles that combine sequence,
    secondary and tertiary structure information were
    developed. They conclude with their evaluation
    that secondary structure plays a significant part
    in remote homology detection over 3D structures.

22
HMAP The setting
23
KD,NWHMAP Details
  • Selecting the correct subset of sequences
    structures
  • No sequence should have greater than 40 sequence
    identity
  • Structures of at least 2.5 Angstrom
  • More than 50 SSE are excluded
  • Sequence-based profiles are generated using
    PSI-BLAST
  • Database is restricted by CD-HIT to 65 identity

24
KD,NWHMAP Details
  • Multiple structure based sequence alignments and
    structure based profiles
  • For each template structure based on the PSD
    score (lt1) get the closest neighbors.
  • Sequence for each of these neighbors is used as a
    seed into PSI-BLAST.
  • Keep only those sequences above a certain
    threshold
  • Purged sequence alignments are combined with a
    sequence using multiple structure alignment as a
    guide. (SEMSA)
  • Get a similar PSSM -gt 3D Profile

25
KD,NWHMAP Details
  • Secondary Structure Profiles
  • Just like in the previous step from the multiple
    structure alignment, create a secondary profile
    where instead of the 20 amino acids we use the
    three secondary structure elements
  • Motifs
  • In regions where they are confident (core/motif)
    about the structure from the alignments, they
    incorporate structural information. In regions
    where they are not confident (loops), they revert
    to 1D info.

26
KD,NWHMAP(1,2) HMAP(1,2,3)
  • HMAP(1,2) 1d Profile secondary structure
    profiles
  • HMAP(1,2,3) interleaving of 1d profile with 3d
    profile based on motif information secondary
    structure information.

27
KD,NWQuery Profiles Alignments
  • Query profiles are generated using PSI Blast
    profile and predicted secondary structure
    information (predicted using JNET, PSIPRED and
    PHD - averaging).
  • Simple DP but the scoring is a dot product of the
    profiles weighted by the secondary structure as
    well as the conservation factors. (NEXT SLIDE)
  • Specific-Structure gap penalties like the FUGUE
    but the implementation is that the penalties are
    based on the secondary structure profiles.

28
Similarity score for HMAP
29
KD,NWPSD Details
30
KD,NWPSD Details
  • a of SSEs for protein A
  • S(A,B) the score from the double dynamic
    programming algorithm aligning SSEs of the pair
    of proteins A and B.
  • x and y are experimentally determined parameters

31
HMAP Evaluation
  • HMAP evaluation was done on the ranking to
    compute the coverage vs accuracy defined earlier
    and it was seen that 1d-profile was better than
    PSI-Blast ranking
  • The performance of HMAP(1,2) was better than
    HMAP(1,2,3) for the super-family level and
    comparable at the fold level. Can we conclude if
    3D information does add anything significant for
    the remote homology detection problem?

32
DRAGAPE Predicting Without
Folds
  • Big Picture Appending profile information with
    secondary structure and solvent accessibility.
    They claim that use of explicit 3D structural
    information as used in other schemes may not be
    very helpful
  • They build generalized profiles based on 1D
    structure predictions (i.e prediction of
    secondary structure and solvent accessibility
    using PROFphd). Some preliminary testing is done
    on known structures using DSSP
  • Six dimensions are added to the profile based on
    the predictions i.e either buried or exposed and
    helix, strand or other

33
DR Generalized PSSM
  • GPSSM(i, j, 1D(i), 1D(j))
  • (r) PSSM(i, j) (1-r) SM(1D(i), 1D(j))
  • GPPS Generalized Position-Spec. Scoring Matrix
  • PSSM from PSI-BLAST
  • SM 1D substitution matrix (learnt by aligning
    predicted predicted or known-known or
    combinations of 1D structures)
  • i query position j database position

34
DR Methods Fold recognition
(SW-PSSM)
  • Obtaining and comparing 1D structures
  • O2O Observed against Observed (DSSP)
  • P2O Predicted (PROFphd) against Observed
  • P2P
  • P2P-Bis
  • Predicted profile against predicted DB sequence
    AND
  • Predicted profile sequence against DB profile

35
DR A simple DP solution
Image from Protein fold recognition by
prediction-based threading. Journal of molecular
biology 0022-2836 Rost yr 1997 vol 270 iss 3
pg 471
36
DRStatistical Significance
  • Fitting the scores to an extreme value
    distribution and modifying the scores to get a
    significance estimate thereby gaining an
    improvement in the ranking.
  • They introduced bidirectional scoring which shows
    improved performance

37
DR(Good?) Prediction Errors
  • P2P does better without using observed because
    the errors correlate
  • i.e., 1D prediction mistakes correlate between
    proteins with dissimilar sequences but similar
    structures
  • Meanings/implications (?)

38
DRResults
  • Using 1D data is
  • Always better that sequence only methods (BLAST)
  • Currently better than some that use 3D
    information
  • However, in the long-run, 3D-including methods
    will probably be better
  • Slow uses DP
  • Why? Though all the other methods also use some
    sort of DP

39
Statistical Significance
  • Fugue z-Score
  • HMAP Mixed Distribution Fit
  • The scores are made to fit a mixed distribution
    which was estimated empirically based on
    shuffling amino acid profiles and aligning
    shuffled sequences to the original with positions
    of secondary structure elements fixed.
  • 3D-PSSM
  • The significance of a match was evaluated by
    fitting a linear relationship between log (number
    of hits up to a score) against log (total score)
  • AGAPE Extreme Value Distribution
  • The significance of a match was evaluating by
    fitting it to an extreme value distribution.
Write a Comment
User Comments (0)
About PowerShow.com