Mismatch String Kernels for SVM Protein Classification

About This Presentation

Title:

Mismatch String Kernels for SVM Protein Classification

Description:

Support Vector Machines. Feature Extraction Scheme. Evaluations. Comparison ... Current state of the art use profiles as an addition into the kernels, they seem ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 18

Provided by: Huz4

Category:

more less

Transcript and Presenter's Notes

Title: Mismatch String Kernels for SVM Protein Classification

1
Mismatch String Kernels for SVM Protein
Classification

By Leslie. et .al
Presented by Huzefa Rangwala
(rangwala_at_cs.umn.edu)

2
Outline

Problem Definition
Support Vector Machines
Feature Extraction Scheme
Evaluations
Comparison with other Schemes
Conclusion

3
Problem Definition

Remote Homology
Motivation Classify proteins based on sequence
data into homologous groups (evolutionary
similar) to understand the structure and
functions of proteins.
Previous known approaches pairwise sequence
alignment, profiles, HMM
New approaches Discriminative Models

4
Problem Definition

Protein sequences are seen as a set of labeled
sequences, i.e positive if they belong to the
same super-family or family and negative
otherwise
Our classifiers will detect as positives those
test sequences that are remotely related to the
positive training instances.

5
Classification Definition
Taken from Using fisher kernel method to
detect remote protein homologies by Jaakkola,
Diekhans, et. Al.
6
SVM Linear Separators

Which of the linear separators is optimal?

7
Classification Margin

Distance from example to the separator is
Examples closest to the hyperplane are support
vectors.
Margin ? of the separator is the width of
separation between classes.

8
The Kernel Trick

The linear classifier relies on inner product
between vectors K(xi,xj)xiTxj
If every datapoint is mapped into
high-dimensional space via some transformation F
x ? f(x), the inner product becomes
K(xi,xj) f(xi) Tf(xj)
A kernel function is some function that
corresponds to an inner product into some feature
space.

9
Feature representation

We are basically having a transformation function
to convert sequences, profiles to be fed into a
SVM black box to train and then classify.
The paper in discussion introduced a new feature
extraction scheme, which is very simplistic but
performs very well.
All future research in this domain is to extract
the best possible features to capture the right
signals for fold prediction.

10
Feature Extraction

Spectrum Kernel (Simpler) This kernel simply
counts the occurrences of k-length subsequences
for each of the sequence in consideration.
This can be done by sliding a window of length k
across a sequence and updating a hash table which
has 20k entries.

11
Spectrum Kernel

Spectrum k-mer kernel
each dimension corresponds to a distinct k-mer
each sequence is represented as a frequency
vector of the various k-mers that it contains.
k 4
Eg - ACDAAA. gt
hash ACDA
hash CDAA
hash DAAA
.
We have our vector of _______ features.

12
Mismatch Kernel

Slight modification to the Spectrum kernel
Define a parameter m which allows up to m
mismatches in the counting of occurrences
This means gt

13
Mismatch Kernel

Adding more information to the kmer kernel
extends the k-mer kernel to allow for up to m
mismatches
Eg- (4,1)
ACDAAA. gt
h ?CDA , h A?DA, hAC?A,hACD?
h ?DAA ,
h?AAA
.
We have our vector of _______ features.

14
Putting all together

SVM can be thought of as a magic box
However it computes the dot product between the 2
vectors to find similarity between two sequences.

Model Learnt
SVM Classifier
SVM Learner
N Vectors, each of dimension 20k
N Training sequences
Test Vector
Classification result
Unknown Test Sequence
15
Evaluation ROC (Explain?)
16
Conclusion

A simple use of just protein sequence gives
better performance than blast
Current state of the art use profiles as an
addition into the kernels, they seem to do better
at this prediction problem than the ones that use
structure directly.
My research revolves around this idea
https//wwws.cs.umn.edu/tech_reports/index.cgi?sel
ectedyear2005modeprintreportreport_id05-007