Title: Machine Learning for Protein Classification: Kernel Methods
1Machine Learning for Protein Classification
Kernel Methods
- CS 374
- Rajesh Ranganath
- 4/10/2008
2Outline
- Biological Motivation and Background
- Algorithmic Concepts
- Mismatch Kernels
- Semi-supervised methods
3Proteins
4The Protein Problem
- Primary Structure can be easily determined
- 3D structure determines function
- Grouping proteins into structural and
evolutionary families is difficult - Use machine learning to group proteins
5How to look at amino acid chains
- Smith-Waterman Idea
- Mismatch Idea
6Families
- Proteins whose evolutionarily relationship is
readily recognizable from the sequence - (gt25 sequence identity)
- Families are further subdivided into Proteins
- Proteins are divided into Species
- The same protein may be found in several species
Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
7Superfamilies
- Proteins which are (remote) evolutionarily
related - Sequence similarity low
- Share function
- Share special structural features
- Relationships between members of a superfamily
may not be readily recognizable from the sequence
alone
Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
8Folds
- Proteins which have gt50 secondary structure
elements arranged the in the same order in the
protein chain and in three dimensions are
classified as having the same fold - No evolutionary relation between proteins
-
Fold
Superfamily
Family
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
9Protein Classification
- Given a new protein, can we place it in its
correct position within an existing protein
hierarchy? - Methods
- BLAST / PsiBLAST
- Profile HMMs
- Supervised Machine Learning methods
Fold
Superfamily
new protein
?
Family
Proteins
10Machine Learning Concepts
- Supervised Methods
- Discriminative Vs. Generative Models
- Transductive Learning
- Support Vector Machines
- Kernel Methods
- Semi-supervised Methods
11Discriminative and Generative Models
12Transductive Learning
- Most Learning is Inductive
- Given (x1,y1) . (xm,ym), for any test input x
predict the label y - Transductive Learning
- Given (x1,y1) . (xm,ym) and all the test input
x1,, xp predict label y1,, yp
13Support Vector Machines
- Popular Discriminative Learning algorithm
- Optimal geometric marginal classifier
- Can be solved efficiently using the Sequential
Minimal Optimization algorithm - If x1 xn training examples, sign(?i?ixiTx)
decides where x falls - Train ?i to achieve best margin
14Support Vector Machines (2)
- Kernalizable The SVM solution can be completely
written down in terms of dot products of the
input. sign(?i?iK(xi,x) determines class of x)
15Kernel Methods
- K(x, z) f(x)Tf(z)
- f is the feature mapping
- x and z are input vectors
- High dimensional features do not need to be
explicitly calculated - Think of the kernel function similarity measure
between x and z - Example
16Mismatch Kernel
- Regions of similar amino acid sequences yield a
similar tertiary structure of proteins - Used as a kernel for an SVM to identify protein
homologies
17k-mer based SVMs
- For given word size k, and mismatch tolerance l,
define -
- K(X, Y) distinct k-long word occurrences
with l mismatches - Define normalized mismatch kernel K(X, Y) K(X,
Y)/ sqrt(K(X,X)K(Y,Y)) - SVM can be learned by supplying this kernel
function
A B A C A R D I
K(X, Y) 4 K(X, Y) 4/sqrt(77) 4/7
Let k 3 l 1
A B R A D A B I
18Disadvantages
- 3D structure of proteins is practically
impossible - Primary sequences are cheap to determine
- How do we use all this unlabeled data?
- Use semi-supervised learning based on the cluster
assumption
19Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
20Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
- SVMs and other discriminative methods may make
significant mistakes due to lack of data
21Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
22Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
23Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
24Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
Attempt to contract the distances within each
cluster while keeping intracluster distances
larger
25Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
26Cluster Kernels
- Semi-supervised methods
- Neighborhood
- For each X, run PSI-BLAST to get similar seqs ?
Nbd(X) - Define Fnbd(X) 1/Nbd(X) ?X ? Nbd(X)
Foriginal(X) - Counts of all k-mers matching with at most 1
diff. all sequences that are similar to X - Knbd(X, Y) 1/(Nbd(X)Nbd(Y)) ?X ? Nbd(X)
?Y ? Nbd(Y) K(X, Y) - Next bagged mismatch
27Bagged Mismatched Kernel
- Final method
- Bagged mismatch
- Run k-means clustering n times, giving p 1,,n
assignments cp(X) - For every X and Y, count up the fraction of times
they are bagged together - Kbag(X, Y) 1/n ?p 1(cp(X) cp (Y))
- Combine the bag fraction with the original
comparison K(.,.) - Knew(X, Y) Kbag(X, Y) K(X, Y)
-
28O. Jangmin
29What works best?
Transductive Setting
30References
- C. Leslie et al. Mismatch string kernels for
discriminative protein classification.
Bioinformatics Advance Access. January 22, 2004. - J. Weston et al. Semi-supervised protein
classification using cluster kernels.2003. - Images pulled under wikiCommons