Title: Using the Fisher kernel method to detect remote protein homologies
1Using the Fisher kernel method to detect remote
protein homologies
- Tommi Jaakkola, Mark Diekhams, David Haussler
- ISMB 99
- Summarized by O, Jangmin (2001/09/22)
2Abstract
- Detecting remote protein homologies
- Fisher kernel method
- Variant of Support Vector Machines using new
kernel function - Derived from Hidden Markov Models
3Introduction (1)
- Detecting protein homologies (sequence-based
algorithm) - BLAST, Fasta, PROBE, templates, profiles,
position-specific weight matrices, HMM - Comparison by (Brenner 1996 Park et al. 1998)
- SCOP classification of protein structures
- Remote protein homologies existing between
protein domain in the same structural
superfamily. - Statistical models like PSI-BLAST and HMMs are
better than simple pairwise comparison methods.
miss many important remote homologies
4Introduction (2)
- Generative statistical models (HMMs)
- Extracting features from protein sequences
- Mapping all protein sequences to points in a
Euclidean feature space of fixed dimension. - General discriminative statistical method to
classify the points. - Improvements acquired
- Over HMMs alone.
5Methods
- How generative models work. (HMMs)
- Training examples ( sequences known to be members
of protein family ) positive - Tuning parameters with a priori knowledge
- Model assigns a probability to any given protein
sequence. - The sequence from that family yield a higher
probability than that of outside family. - Log-likelihood ratio as score
null model
6Discriminative approaches
- Using both positive and negative examples
- Parameter is tuned so that the model can
optimally discriminate members of the family from
nonmembers. - When training examples are few
- Likelihood ratio is optimal if generative models
perfectly fit to data but - Discriminative methods often performs better.
7Kernel methods
- Discriminant function L(X)
- Where Xi, i 1,,n and hypothesis class H1,
H2 - the sequence of the family, - outside of
the family - Contribution of Kernel
- ?i overall importance of the example Xi.
- Measure of pairwise similarity K(Xi, X)
- User supplies the type of kernel for the
application area!!
8The Fisher kernel (1)
- Deriving kernel function from generative models
- Advantage 1 handle variable length protein
sequences!! - Advantage 2 encoding of prior knowledge about
protein sequences - HMMs (difference)
- Kernel function specifies a similarity score for
any pair of sequences. - Likelihood score from an HMM only measures the
closeness of the sequence to the model itself.
9The Fisher kernel (2)
- Sufficient statistics
- Each parameter in HMM Posterior frequencies
- Of particular transition.
- Of generating one of the residues of the query
sequence. - Reflects the process of generating the query
sequence from HMM. - Alterative of sufficient statistics Fisher
score - Magnitude of the components how each
contributes to generating the query sequence.
10The Fisher kernel (3)
- Kernel function used in this paper.
- note that its fixed vector.
- Summary
- Train HMM with positive examples.
- Map each new protein sequence X into a fixed
vector, Fisher score. - Calculate the kernel function
- Get resulting discriminant function (SVM-Fisher)
11The Fisher kernel (4)
- Combination of scores
- There might be more than one HMM model for the
family or superfamily of interest. - Average score
- Maximum score
12Experimental Methods
- Methods
- SVM-Fisher (this paper)
- BLAST (Altshul et al. 1990 Gish States 1993)
- HMMs using SAM-T98 methodology (Park et al. 1998
Karplus, Barrett, Hughey 1998 Hughey Krogh
1995l 1996) - Measurement of recognition rate for members of
superfamilies of the SCOP protein structure
classification (Hubbard et al. 1997) - Withholding all members of SCOP family
- Train with the remaining members of SCOP
superfamily - Test with withheld data
- Question Could the method discover a new family
of a known superfamily?
13Overview of experiments
- Database
- SCOP version 1.37 PDB90 consisting of protein
domains, no two of which have 90 of more residue
identity - PDB90 eliminates redundant sequences.
- Generative models
- SAM-T98 HMMs (alignment of the domain sequence
and final set of homologs) - Data selection
- Get 33 test families from 16 superfamilies.
- Evaluation strategy
- Assessing to what extent it gave better scores to
the positive test examples than it gave to the
negative test examples.
14SCOP a Structural Classification of Proteins
database
- Hierachical levels
- Family clustered proteins by common evolutionary
origin residue identities of above 30, lower
sequence identities but very similar functions
and structures - Superfamily low sequence identities but probably
common evolutionary origin - Fold same major secondary structure in the same
arrangement and with the same topological
connections
15Figure 1 Separation of the SCOP PDB90 database
into training and test sequences, shown for the G
proteins test family
16Multiple models used
- Modeling superfamily
- SAM-T98 starts with a single sequence (the
guide sequence for the domain) and build a model - Using a subset of PDB90 for building model of
other families (Too many sequences) - Train SVM-Fisher method using each of models in
turn
17Details on the training and test sets
- All PDB90 sequence outside the fold of the test
family were used as either negative training or
negative test examples. - Reverse test/training allocation of negative
examples, and repeat experiments. - Fold-by-fold basis split of negative examples.
- For positive examples
- PDB90 sequences in the superfamily of the test
family are used. - Homologs found by each individual SAM-T98 model
are used.
18BLAST methods
- WU-BLAST version 2.0a16 (Althcshul Gish 1996)
- PDB90 database was queried with each positive
training examples, and E-values were recorded. - BLASTSCOP-only
- BLASTSCOPSAM-T98-homologs
- Scores were combined by the maximum method
19Generative HMM models
- SAM-T98 method
- Null model reverse sequence model
- Same data and same set of models as in the
SVM-Fisher - Combined with maximum methods
20Results
- Metric the rate of false positives (RFP)
- RFP for a positive test sequence the fraction
of negative test sequences that score as good of
better than positive sequence.
21G-proteins
- The result of the family of the nucleotide
triphosphate hydrolases SCOP superfamily - Test the ability to distinguish 8 PDB90 G
proteins from 2439 sequences in other SCOP folds. - Table 1
- In SVM-Fisher
- 5 of the 8 G proteins are better than all 2439
negative test sequences. - Maximum RFP
- Median RFP
- Figure 2
- RFP curve
22Figure 1 Separation of the SCOP PDB90 database
into training and test sequences, shown for the G
proteins test family
23Table 1. Rate of false positives for G proteins
family. BLAST BLASTSCOP-only, B-Hom
BLASTSCOPSAMT-98-homologs, S-T98 SAMT-98, and
SVM-F SVM-Fisher method
24Figure 2 4 methods on the 33 test families.
Curve of median RFP
25Discussion
- New approach
- to recognition of remote protein homologies make
a discriminative method built on top of a
generative model (HMMs) - Discriminative method on top of HMM methods
- Significant improvement
- Combining multiple score would be improved.
- Allocation problem
- Different training set for tuning HMM and
different training set for discriminative model - Extend the method to identify multiple domains
within large protein sequences