Machine Learning for Protein Classification: Kernel Methods - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Machine Learning for Protein Classification: Kernel Methods

Description:

MACHINE LEARNING FOR PROTEIN CLASSIFICATION ... Kernel Methods Outline Proteins The Protein Problem How to look at amino acid chains Families ... – PowerPoint PPT presentation

Number of Views:228
Avg rating:3.0/5.0
Slides: 31
Provided by: raje188
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning for Protein Classification: Kernel Methods


1
Machine Learning for Protein Classification
Kernel Methods
  • CS 374
  • Rajesh Ranganath
  • 4/10/2008

2
Outline
  • Biological Motivation and Background
  • Algorithmic Concepts
  • Mismatch Kernels
  • Semi-supervised methods

3
Proteins
4
The Protein Problem
  • Primary Structure can be easily determined
  • 3D structure determines function
  • Grouping proteins into structural and
    evolutionary families is difficult
  • Use machine learning to group proteins

5
How to look at amino acid chains
  • Smith-Waterman Idea
  • Mismatch Idea

6
Families
  • Proteins whose evolutionarily relationship is
    readily recognizable from the sequence
  • (gt25 sequence identity)
  • Families are further subdivided into Proteins
  • Proteins are divided into Species
  • The same protein may be found in several species

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
7
Superfamilies
  • Proteins which are (remote) evolutionarily
    related
  • Sequence similarity low
  • Share function
  • Share special structural features
  • Relationships between members of a superfamily
    may not be readily recognizable from the sequence
    alone

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
8
Folds
  • Proteins which have gt50 secondary structure
    elements arranged the in the same order in the
    protein chain and in three dimensions are
    classified as having the same fold
  • No evolutionary relation between proteins

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
9
Protein Classification
  • Given a new protein, can we place it in its
    correct position within an existing protein
    hierarchy?
  • Methods
  • BLAST / PsiBLAST
  • Profile HMMs
  • Supervised Machine Learning methods

Fold
Superfamily
new protein
?
Family
Proteins
10
Machine Learning Concepts
  • Supervised Methods
  • Discriminative Vs. Generative Models
  • Transductive Learning
  • Support Vector Machines
  • Kernel Methods
  • Semi-supervised Methods

11
Discriminative and Generative Models
  • Discriminative
  • Generative

12
Transductive Learning
  • Most Learning is Inductive
  • Given (x1,y1) . (xm,ym), for any test input x
    predict the label y
  • Transductive Learning
  • Given (x1,y1) . (xm,ym) and all the test input
    x1,, xp predict label y1,, yp

13
Support Vector Machines
  • Popular Discriminative Learning algorithm
  • Optimal geometric marginal classifier
  • Can be solved efficiently using the Sequential
    Minimal Optimization algorithm
  • If x1 xn training examples, sign(?i?ixiTx)
    decides where x falls
  • Train ?i to achieve best margin

14
Support Vector Machines (2)
  • Kernalizable The SVM solution can be completely
    written down in terms of dot products of the
    input. sign(?i?iK(xi,x) determines class of x)

15
Kernel Methods
  • K(x, z) f(x)Tf(z)
  • f is the feature mapping
  • x and z are input vectors
  • High dimensional features do not need to be
    explicitly calculated
  • Think of the kernel function similarity measure
    between x and z
  • Example

16
Mismatch Kernel
  • Regions of similar amino acid sequences yield a
    similar tertiary structure of proteins
  • Used as a kernel for an SVM to identify protein
    homologies

17
k-mer based SVMs
  • For given word size k, and mismatch tolerance l,
    define
  • K(X, Y) distinct k-long word occurrences
    with l mismatches
  • Define normalized mismatch kernel K(X, Y) K(X,
    Y)/ sqrt(K(X,X)K(Y,Y))
  • SVM can be learned by supplying this kernel
    function

A B A C A R D I
K(X, Y) 4 K(X, Y) 4/sqrt(77) 4/7
Let k 3 l 1
A B R A D A B I
18
Disadvantages
  • 3D structure of proteins is practically
    impossible
  • Primary sequences are cheap to determine
  • How do we use all this unlabeled data?
  • Use semi-supervised learning based on the cluster
    assumption

19
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

20
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples
  • SVMs and other discriminative methods may make
    significant mistakes due to lack of data

21
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

22
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

23
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

24
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

Attempt to contract the distances within each
cluster while keeping intracluster distances
larger
25
Semi-Supervised Methods
  • Some examples are labeled
  • Assume labels vary smoothly among all examples

26
Cluster Kernels
  • Semi-supervised methods
  • Neighborhood
  • For each X, run PSI-BLAST to get similar seqs ?
    Nbd(X)
  • Define Fnbd(X) 1/Nbd(X) ?X ? Nbd(X)
    Foriginal(X)
  • Counts of all k-mers matching with at most 1
    diff. all sequences that are similar to X
  • Knbd(X, Y) 1/(Nbd(X)Nbd(Y)) ?X ? Nbd(X)
    ?Y ? Nbd(Y) K(X, Y)
  • Next bagged mismatch

27
Bagged Mismatched Kernel
  • Final method
  • Bagged mismatch
  • Run k-means clustering n times, giving p 1,,n
    assignments cp(X)
  • For every X and Y, count up the fraction of times
    they are bagged together
  • Kbag(X, Y) 1/n ?p 1(cp(X) cp (Y))
  • Combine the bag fraction with the original
    comparison K(.,.)
  • Knew(X, Y) Kbag(X, Y) K(X, Y)

28
O. Jangmin
29
What works best?
Transductive Setting
30
References
  • C. Leslie et al. Mismatch string kernels for
    discriminative protein classification.
    Bioinformatics Advance Access. January 22, 2004.
  • J. Weston et al. Semi-supervised protein
    classification using cluster kernels.2003.
  • Images pulled under wikiCommons
Write a Comment
User Comments (0)
About PowerShow.com