Mismatch String Kernels for SVM Protein Classification - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Mismatch String Kernels for SVM Protein Classification

Description:

Support Vector Machines. Feature Extraction Scheme. Evaluations. Comparison ... Current state of the art use profiles as an addition into the kernels, they seem ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 18
Provided by: Huz4
Category:

less

Transcript and Presenter's Notes

Title: Mismatch String Kernels for SVM Protein Classification


1
Mismatch String Kernels for SVM Protein
Classification
  • By Leslie. et .al
  • Presented by Huzefa Rangwala
  • (rangwala_at_cs.umn.edu)

2
Outline
  • Problem Definition
  • Support Vector Machines
  • Feature Extraction Scheme
  • Evaluations
  • Comparison with other Schemes
  • Conclusion

3
Problem Definition
  • Remote Homology
  • Motivation Classify proteins based on sequence
    data into homologous groups (evolutionary
    similar) to understand the structure and
    functions of proteins.
  • Previous known approaches pairwise sequence
    alignment, profiles, HMM
  • New approaches Discriminative Models

4
Problem Definition
  • Protein sequences are seen as a set of labeled
    sequences, i.e positive if they belong to the
    same super-family or family and negative
    otherwise
  • Our classifiers will detect as positives those
    test sequences that are remotely related to the
    positive training instances.

5
Classification Definition
Taken from Using fisher kernel method to
detect remote protein homologies by Jaakkola,
Diekhans, et. Al.
6
SVM Linear Separators
  • Which of the linear separators is optimal?

7
Classification Margin
  • Distance from example to the separator is
  • Examples closest to the hyperplane are support
    vectors.
  • Margin ? of the separator is the width of
    separation between classes.

8
The Kernel Trick
  • The linear classifier relies on inner product
    between vectors K(xi,xj)xiTxj
  • If every datapoint is mapped into
    high-dimensional space via some transformation F
    x ? f(x), the inner product becomes
  • K(xi,xj) f(xi) Tf(xj)
  • A kernel function is some function that
    corresponds to an inner product into some feature
    space.

9
Feature representation
  • We are basically having a transformation function
    to convert sequences, profiles to be fed into a
    SVM black box to train and then classify.
  • The paper in discussion introduced a new feature
    extraction scheme, which is very simplistic but
    performs very well.
  • All future research in this domain is to extract
    the best possible features to capture the right
    signals for fold prediction.

10
Feature Extraction
  • Spectrum Kernel (Simpler) This kernel simply
    counts the occurrences of k-length subsequences
    for each of the sequence in consideration.
  • This can be done by sliding a window of length k
    across a sequence and updating a hash table which
    has 20k entries.

11
Spectrum Kernel
  • Spectrum k-mer kernel
  • each dimension corresponds to a distinct k-mer
  • each sequence is represented as a frequency
    vector of the various k-mers that it contains.
  • k 4
  • Eg - ACDAAA. gt
  • hash ACDA
  • hash CDAA
  • hash DAAA
  • .
  • We have our vector of _______ features.

12
Mismatch Kernel
  • Slight modification to the Spectrum kernel
  • Define a parameter m which allows up to m
    mismatches in the counting of occurrences
  • This means gt

13
Mismatch Kernel
  • Adding more information to the kmer kernel
  • extends the k-mer kernel to allow for up to m
    mismatches
  • Eg- (4,1)
  • ACDAAA. gt
  • h ?CDA , h A?DA, hAC?A,hACD?
  • h ?DAA ,
  • h?AAA
  • .
  • We have our vector of _______ features.

14
Putting all together
  • SVM can be thought of as a magic box
  • However it computes the dot product between the 2
    vectors to find similarity between two sequences.

Model Learnt
SVM Classifier
SVM Learner
N Vectors, each of dimension 20k
N Training sequences
Test Vector
Classification result
Unknown Test Sequence
15
Evaluation ROC (Explain?)
16
Conclusion
  • A simple use of just protein sequence gives
    better performance than blast
  • Current state of the art use profiles as an
    addition into the kernels, they seem to do better
    at this prediction problem than the ones that use
    structure directly.
  • My research revolves around this idea
  • https//wwws.cs.umn.edu/tech_reports/index.cgi?sel
    ectedyear2005modeprintreportreport_id05-007

17
Questions ?
  • Huzefa Rangwala
  • PhD in Computer Science
  • Email rangwala_at_cs.umn.edu
Write a Comment
User Comments (0)
About PowerShow.com