Using the Fisher kernel method to detect remote protein homologies

1 / 25
About This Presentation
Title:

Using the Fisher kernel method to detect remote protein homologies

Description:

Using the Fisher kernel method to detect remote protein homologies ... Question: 'Could the method discover a new family of a known superfamily? ... –

Number of Views:111
Avg rating:3.0/5.0
Slides: 26
Provided by: cbitS
Category:

less

Transcript and Presenter's Notes

Title: Using the Fisher kernel method to detect remote protein homologies


1
Using the Fisher kernel method to detect remote
protein homologies
  • Tommi Jaakkola, Mark Diekhams, David Haussler
  • ISMB 99
  • Summarized by O, Jangmin (2001/09/22)

2
Abstract
  • Detecting remote protein homologies
  • Fisher kernel method
  • Variant of Support Vector Machines using new
    kernel function
  • Derived from Hidden Markov Models

3
Introduction (1)
  • Detecting protein homologies (sequence-based
    algorithm)
  • BLAST, Fasta, PROBE, templates, profiles,
    position-specific weight matrices, HMM
  • Comparison by (Brenner 1996 Park et al. 1998)
  • SCOP classification of protein structures
  • Remote protein homologies existing between
    protein domain in the same structural
    superfamily.
  • Statistical models like PSI-BLAST and HMMs are
    better than simple pairwise comparison methods.

miss many important remote homologies
4
Introduction (2)
  • Generative statistical models (HMMs)
  • Extracting features from protein sequences
  • Mapping all protein sequences to points in a
    Euclidean feature space of fixed dimension.
  • General discriminative statistical method to
    classify the points.
  • Improvements acquired
  • Over HMMs alone.

5
Methods
  • How generative models work. (HMMs)
  • Training examples ( sequences known to be members
    of protein family ) positive
  • Tuning parameters with a priori knowledge
  • Model assigns a probability to any given protein
    sequence.
  • The sequence from that family yield a higher
    probability than that of outside family.
  • Log-likelihood ratio as score

null model
6
Discriminative approaches
  • Using both positive and negative examples
  • Parameter is tuned so that the model can
    optimally discriminate members of the family from
    nonmembers.
  • When training examples are few
  • Likelihood ratio is optimal if generative models
    perfectly fit to data but
  • Discriminative methods often performs better.

7
Kernel methods
  • Discriminant function L(X)
  • Where Xi, i 1,,n and hypothesis class H1,
    H2
  • the sequence of the family, - outside of
    the family
  • Contribution of Kernel
  • ?i overall importance of the example Xi.
  • Measure of pairwise similarity K(Xi, X)
  • User supplies the type of kernel for the
    application area!!

8
The Fisher kernel (1)
  • Deriving kernel function from generative models
  • Advantage 1 handle variable length protein
    sequences!!
  • Advantage 2 encoding of prior knowledge about
    protein sequences
  • HMMs (difference)
  • Kernel function specifies a similarity score for
    any pair of sequences.
  • Likelihood score from an HMM only measures the
    closeness of the sequence to the model itself.

9
The Fisher kernel (2)
  • Sufficient statistics
  • Each parameter in HMM Posterior frequencies
  • Of particular transition.
  • Of generating one of the residues of the query
    sequence.
  • Reflects the process of generating the query
    sequence from HMM.
  • Alterative of sufficient statistics Fisher
    score
  • Magnitude of the components how each
    contributes to generating the query sequence.

10
The Fisher kernel (3)
  • Kernel function used in this paper.
  • note that its fixed vector.
  • Summary
  • Train HMM with positive examples.
  • Map each new protein sequence X into a fixed
    vector, Fisher score.
  • Calculate the kernel function
  • Get resulting discriminant function (SVM-Fisher)

11
The Fisher kernel (4)
  • Combination of scores
  • There might be more than one HMM model for the
    family or superfamily of interest.
  • Average score
  • Maximum score

12
Experimental Methods
  • Methods
  • SVM-Fisher (this paper)
  • BLAST (Altshul et al. 1990 Gish States 1993)
  • HMMs using SAM-T98 methodology (Park et al. 1998
    Karplus, Barrett, Hughey 1998 Hughey Krogh
    1995l 1996)
  • Measurement of recognition rate for members of
    superfamilies of the SCOP protein structure
    classification (Hubbard et al. 1997)
  • Withholding all members of SCOP family
  • Train with the remaining members of SCOP
    superfamily
  • Test with withheld data
  • Question Could the method discover a new family
    of a known superfamily?

13
Overview of experiments
  • Database
  • SCOP version 1.37 PDB90 consisting of protein
    domains, no two of which have 90 of more residue
    identity
  • PDB90 eliminates redundant sequences.
  • Generative models
  • SAM-T98 HMMs (alignment of the domain sequence
    and final set of homologs)
  • Data selection
  • Get 33 test families from 16 superfamilies.
  • Evaluation strategy
  • Assessing to what extent it gave better scores to
    the positive test examples than it gave to the
    negative test examples.

14
SCOP a Structural Classification of Proteins
database
  • Hierachical levels
  • Family clustered proteins by common evolutionary
    origin residue identities of above 30, lower
    sequence identities but very similar functions
    and structures
  • Superfamily low sequence identities but probably
    common evolutionary origin
  • Fold same major secondary structure in the same
    arrangement and with the same topological
    connections

15
Figure 1 Separation of the SCOP PDB90 database
into training and test sequences, shown for the G
proteins test family
16
Multiple models used
  • Modeling superfamily
  • SAM-T98 starts with a single sequence (the
    guide sequence for the domain) and build a model
  • Using a subset of PDB90 for building model of
    other families (Too many sequences)
  • Train SVM-Fisher method using each of models in
    turn

17
Details on the training and test sets
  • All PDB90 sequence outside the fold of the test
    family were used as either negative training or
    negative test examples.
  • Reverse test/training allocation of negative
    examples, and repeat experiments.
  • Fold-by-fold basis split of negative examples.
  • For positive examples
  • PDB90 sequences in the superfamily of the test
    family are used.
  • Homologs found by each individual SAM-T98 model
    are used.

18
BLAST methods
  • WU-BLAST version 2.0a16 (Althcshul Gish 1996)
  • PDB90 database was queried with each positive
    training examples, and E-values were recorded.
  • BLASTSCOP-only
  • BLASTSCOPSAM-T98-homologs
  • Scores were combined by the maximum method

19
Generative HMM models
  • SAM-T98 method
  • Null model reverse sequence model
  • Same data and same set of models as in the
    SVM-Fisher
  • Combined with maximum methods

20
Results
  • Metric the rate of false positives (RFP)
  • RFP for a positive test sequence the fraction
    of negative test sequences that score as good of
    better than positive sequence.

21
G-proteins
  • The result of the family of the nucleotide
    triphosphate hydrolases SCOP superfamily
  • Test the ability to distinguish 8 PDB90 G
    proteins from 2439 sequences in other SCOP folds.
  • Table 1
  • In SVM-Fisher
  • 5 of the 8 G proteins are better than all 2439
    negative test sequences.
  • Maximum RFP
  • Median RFP
  • Figure 2
  • RFP curve

22
Figure 1 Separation of the SCOP PDB90 database
into training and test sequences, shown for the G
proteins test family
23
Table 1. Rate of false positives for G proteins
family. BLAST BLASTSCOP-only, B-Hom
BLASTSCOPSAMT-98-homologs, S-T98 SAMT-98, and
SVM-F SVM-Fisher method
24
Figure 2 4 methods on the 33 test families.
Curve of median RFP
25
Discussion
  • New approach
  • to recognition of remote protein homologies make
    a discriminative method built on top of a
    generative model (HMMs)
  • Discriminative method on top of HMM methods
  • Significant improvement
  • Combining multiple score would be improved.
  • Allocation problem
  • Different training set for tuning HMM and
    different training set for discriminative model
  • Extend the method to identify multiple domains
    within large protein sequences
Write a Comment
User Comments (0)
About PowerShow.com