Machine Learning for Protein Classification: Kernel Methods - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Machine Learning for Protein Classification: Kernel Methods

Description:

MACHINE LEARNING FOR PROTEIN CLASSIFICATION ... Kernel Methods Outline Proteins The Protein Problem How to look at amino acid chains Families ... – PowerPoint PPT presentation

Number of Views:233

Avg rating:3.0/5.0

Slides: 31

Provided by: raje188

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning for Protein Classification: Kernel Methods

1
Machine Learning for Protein Classification
Kernel Methods

CS 374
Rajesh Ranganath
4/10/2008

2
Outline

Biological Motivation and Background
Algorithmic Concepts
Mismatch Kernels
Semi-supervised methods

3
Proteins
4
The Protein Problem

Primary Structure can be easily determined
3D structure determines function
Grouping proteins into structural and
evolutionary families is difficult
Use machine learning to group proteins

5
How to look at amino acid chains

Smith-Waterman Idea
Mismatch Idea

6
Families

Proteins whose evolutionarily relationship is
readily recognizable from the sequence
(gt25 sequence identity)
Families are further subdivided into Proteins
Proteins are divided into Species
The same protein may be found in several species

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
7
Superfamilies

Proteins which are (remote) evolutionarily
related
Sequence similarity low
Share function
Share special structural features
Relationships between members of a superfamily
may not be readily recognizable from the sequence
alone

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
8
Folds

Proteins which have gt50 secondary structure
elements arranged the in the same order in the
protein chain and in three dimensions are
classified as having the same fold
No evolutionary relation between proteins

Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
9
Protein Classification

Given a new protein, can we place it in its
correct position within an existing protein
hierarchy?
Methods
BLAST / PsiBLAST
Profile HMMs
Supervised Machine Learning methods

Fold
Superfamily
new protein
?
Family
Proteins
10
Machine Learning Concepts

Supervised Methods
Discriminative Vs. Generative Models
Transductive Learning
Support Vector Machines
Kernel Methods
Semi-supervised Methods

11
Discriminative and Generative Models

Discriminative

Generative

12
Transductive Learning

Most Learning is Inductive
Given (x1,y1) . (xm,ym), for any test input x
predict the label y
Transductive Learning
Given (x1,y1) . (xm,ym) and all the test input
x1,, xp predict label y1,, yp

13
Support Vector Machines

Popular Discriminative Learning algorithm
Optimal geometric marginal classifier
Can be solved efficiently using the Sequential
Minimal Optimization algorithm
If x1 xn training examples, sign(?i?ixiTx)
decides where x falls
Train ?i to achieve best margin

14
Support Vector Machines (2)

Kernalizable The SVM solution can be completely
written down in terms of dot products of the
input. sign(?i?iK(xi,x) determines class of x)

15
Kernel Methods

K(x, z) f(x)Tf(z)
f is the feature mapping
x and z are input vectors
High dimensional features do not need to be
explicitly calculated
Think of the kernel function similarity measure
between x and z
Example

16
Mismatch Kernel

Regions of similar amino acid sequences yield a
similar tertiary structure of proteins
Used as a kernel for an SVM to identify protein
homologies

17
k-mer based SVMs

For given word size k, and mismatch tolerance l,
define
K(X, Y) distinct k-long word occurrences
with l mismatches
Define normalized mismatch kernel K(X, Y) K(X,
Y)/ sqrt(K(X,X)K(Y,Y))
SVM can be learned by supplying this kernel
function

A B A C A R D I
K(X, Y) 4 K(X, Y) 4/sqrt(77) 4/7
Let k 3 l 1
A B R A D A B I
18
Disadvantages

3D structure of proteins is practically
impossible
Primary sequences are cheap to determine
How do we use all this unlabeled data?
Use semi-supervised learning based on the cluster
assumption

19
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

20
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

SVMs and other discriminative methods may make
significant mistakes due to lack of data

21
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

22
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

23
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

24
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

Attempt to contract the distances within each
cluster while keeping intracluster distances
larger
25
Semi-Supervised Methods

Some examples are labeled
Assume labels vary smoothly among all examples

26
Cluster Kernels

Semi-supervised methods
Neighborhood
For each X, run PSI-BLAST to get similar seqs ?
Nbd(X)
Define Fnbd(X) 1/Nbd(X) ?X ? Nbd(X)
Foriginal(X)
Counts of all k-mers matching with at most 1
diff. all sequences that are similar to X
Knbd(X, Y) 1/(Nbd(X)Nbd(Y)) ?X ? Nbd(X)
?Y ? Nbd(Y) K(X, Y)
Next bagged mismatch

27
Bagged Mismatched Kernel

Final method
Bagged mismatch
Run k-means clustering n times, giving p 1,,n
assignments cp(X)
For every X and Y, count up the fraction of times
they are bagged together
Kbag(X, Y) 1/n ?p 1(cp(X) cp (Y))
Combine the bag fraction with the original
comparison K(.,.)
Knew(X, Y) Kbag(X, Y) K(X, Y)

28
O. Jangmin
29
What works best?
Transductive Setting
30
References

C. Leslie et al. Mismatch string kernels for
discriminative protein classification.
Bioinformatics Advance Access. January 22, 2004.
J. Weston et al. Semi-supervised protein
classification using cluster kernels.2003.
Images pulled under wikiCommons

Write a Comment

User Comments (0)