Using the Fisher kernel method to detect remote protein homologies

1 / 25

About This Presentation

Title:

Using the Fisher kernel method to detect remote protein homologies

Description:

Using the Fisher kernel method to detect remote protein homologies ... Question: 'Could the method discover a new family of a known superfamily? ... –

Number of Views:111

Avg rating:3.0/5.0

Slides: 26

Provided by: cbitS

Category:

more less

Transcript and Presenter's Notes

Title: Using the Fisher kernel method to detect remote protein homologies

1
Using the Fisher kernel method to detect remote
protein homologies

Tommi Jaakkola, Mark Diekhams, David Haussler
ISMB 99
Summarized by O, Jangmin (2001/09/22)

2
Abstract

Detecting remote protein homologies
Fisher kernel method
Variant of Support Vector Machines using new
kernel function
Derived from Hidden Markov Models

3
Introduction (1)

Detecting protein homologies (sequence-based
algorithm)
BLAST, Fasta, PROBE, templates, profiles,
position-specific weight matrices, HMM
Comparison by (Brenner 1996 Park et al. 1998)
SCOP classification of protein structures
Remote protein homologies existing between
protein domain in the same structural
superfamily.
Statistical models like PSI-BLAST and HMMs are
better than simple pairwise comparison methods.

miss many important remote homologies
4
Introduction (2)

Generative statistical models (HMMs)
Extracting features from protein sequences
Mapping all protein sequences to points in a
Euclidean feature space of fixed dimension.
General discriminative statistical method to
classify the points.
Improvements acquired
Over HMMs alone.

5
Methods

How generative models work. (HMMs)
Training examples ( sequences known to be members
of protein family ) positive
Tuning parameters with a priori knowledge
Model assigns a probability to any given protein
sequence.
The sequence from that family yield a higher
probability than that of outside family.
Log-likelihood ratio as score

null model
6
Discriminative approaches

Using both positive and negative examples
Parameter is tuned so that the model can
optimally discriminate members of the family from
nonmembers.
When training examples are few
Likelihood ratio is optimal if generative models
perfectly fit to data but
Discriminative methods often performs better.

7
Kernel methods

Discriminant function L(X)
Where Xi, i 1,,n and hypothesis class H1,
H2
the sequence of the family, - outside of
the family
Contribution of Kernel
?i overall importance of the example Xi.
Measure of pairwise similarity K(Xi, X)
User supplies the type of kernel for the
application area!!

8
The Fisher kernel (1)

Deriving kernel function from generative models
Advantage 1 handle variable length protein
sequences!!
Advantage 2 encoding of prior knowledge about
protein sequences
HMMs (difference)
Kernel function specifies a similarity score for
any pair of sequences.
Likelihood score from an HMM only measures the
closeness of the sequence to the model itself.

9
The Fisher kernel (2)

Sufficient statistics
Each parameter in HMM Posterior frequencies
Of particular transition.
Of generating one of the residues of the query
sequence.
Reflects the process of generating the query
sequence from HMM.
Alterative of sufficient statistics Fisher
score
Magnitude of the components how each
contributes to generating the query sequence.

10
The Fisher kernel (3)

Kernel function used in this paper.
note that its fixed vector.
Summary
Train HMM with positive examples.
Map each new protein sequence X into a fixed
vector, Fisher score.
Calculate the kernel function
Get resulting discriminant function (SVM-Fisher)

11
The Fisher kernel (4)

Combination of scores
There might be more than one HMM model for the
family or superfamily of interest.
Average score
Maximum score

12
Experimental Methods

Methods
SVM-Fisher (this paper)
BLAST (Altshul et al. 1990 Gish States 1993)
HMMs using SAM-T98 methodology (Park et al. 1998
Karplus, Barrett, Hughey 1998 Hughey Krogh
1995l 1996)
Measurement of recognition rate for members of
superfamilies of the SCOP protein structure
classification (Hubbard et al. 1997)
Withholding all members of SCOP family
Train with the remaining members of SCOP
superfamily
Test with withheld data
Question Could the method discover a new family
of a known superfamily?

13
Overview of experiments

Database
SCOP version 1.37 PDB90 consisting of protein
domains, no two of which have 90 of more residue
identity
PDB90 eliminates redundant sequences.
Generative models
SAM-T98 HMMs (alignment of the domain sequence
and final set of homologs)
Data selection
Get 33 test families from 16 superfamilies.
Evaluation strategy
Assessing to what extent it gave better scores to
the positive test examples than it gave to the
negative test examples.

14
SCOP a Structural Classification of Proteins
database

Hierachical levels
Family clustered proteins by common evolutionary
origin residue identities of above 30, lower
sequence identities but very similar functions
and structures
Superfamily low sequence identities but probably
common evolutionary origin
Fold same major secondary structure in the same
arrangement and with the same topological
connections

15
Figure 1 Separation of the SCOP PDB90 database
into training and test sequences, shown for the G
proteins test family
16
Multiple models used

Modeling superfamily
SAM-T98 starts with a single sequence (the
guide sequence for the domain) and build a model
Using a subset of PDB90 for building model of
other families (Too many sequences)
Train SVM-Fisher method using each of models in
turn

17
Details on the training and test sets

All PDB90 sequence outside the fold of the test
family were used as either negative training or
negative test examples.
Reverse test/training allocation of negative
examples, and repeat experiments.
Fold-by-fold basis split of negative examples.
For positive examples
PDB90 sequences in the superfamily of the test
family are used.
Homologs found by each individual SAM-T98 model
are used.

18
BLAST methods

WU-BLAST version 2.0a16 (Althcshul Gish 1996)
PDB90 database was queried with each positive
training examples, and E-values were recorded.
BLASTSCOP-only
BLASTSCOPSAM-T98-homologs
Scores were combined by the maximum method

19
Generative HMM models

SAM-T98 method
Null model reverse sequence model
Same data and same set of models as in the
SVM-Fisher
Combined with maximum methods

20
Results

Metric the rate of false positives (RFP)
RFP for a positive test sequence the fraction
of negative test sequences that score as good of
better than positive sequence.

21
G-proteins

The result of the family of the nucleotide
triphosphate hydrolases SCOP superfamily
Test the ability to distinguish 8 PDB90 G
proteins from 2439 sequences in other SCOP folds.
Table 1
In SVM-Fisher
5 of the 8 G proteins are better than all 2439
negative test sequences.
Maximum RFP
Median RFP
Figure 2
RFP curve

22
Figure 1 Separation of the SCOP PDB90 database
into training and test sequences, shown for the G
proteins test family
23
Table 1. Rate of false positives for G proteins
family. BLAST BLASTSCOP-only, B-Hom
BLASTSCOPSAMT-98-homologs, S-T98 SAMT-98, and
SVM-F SVM-Fisher method
24
Figure 2 4 methods on the 33 test families.
Curve of median RFP
25
Discussion

New approach
to recognition of remote protein homologies make
a discriminative method built on top of a
generative model (HMMs)
Discriminative method on top of HMM methods
Significant improvement
Combining multiple score would be improved.
Allocation problem
Different training set for tuning HMM and
different training set for discriminative model
Extend the method to identify multiple domains
within large protein sequences