Title: Protein Classification
1Protein Classification
2Protein Classification
- Given a new protein, can we place it in its
correct position within an existing protein
hierarchy? - Methods
- BLAST / PsiBLAST
- Profile HMMs
- Supervised Machine Learning methods
Fold
Superfamily
new protein
?
Family
Proteins
3PSI-BLAST
- Given a sequence query x, and database D
- Find all pairwise alignments of x to sequences in
D - Collect all matches of x to y with some minimum
significance - Construct position specific matrix M
- Each sequence y is given a weight so that many
similar sequences cannot have much influence on a
position (Henikoff Henikoff 1994) - Using the matrix M, search D for more matches
- Iterate 14 until convergence
Profile M
4Classification with Profile HMMs
Fold
Superfamily
Family
new protein
?
5The Fisher Kernel
- Fisher score
- UX ?? log P(X H1, ?)
- Quantifies how each parameter contributes to
generating X - For two different sequences X and Y, can compare
UX, UY - D2F(X, Y) ½ ?2 UX UY2
- Given this distance function, K(X, Y) is defined
as a similarity measure - K(X, Y) exp(-D2F(X, Y))
- Set ? so that the average distance of training
sequences Xi ? H1 to sequences Xj ? H0 is 1
6The Fisher Kernel
- To train a classifier for a given family H1,
- Build profile HMM, H1
- UX ?? log P(X H1, ?) (Fisher score)
- D2F(X, Y) ½ ?2 UX UY2 (distance)
- K(X, Y) exp(-D2F(X, Y)), (akin to dot
product) - L(X) ?Xi?H1 ?i K(X, Xi) ?Xj?H0 ?j K(X, Xj)
- Iteratively adjust ? to optimize
- J(?) ?Xi?H1 ?i(2 - L(Xi)) ?Xj?H0 ?j(2
L(Xj)) - To classify query X,
- Compute UX
- Compute K(X, Xi) for all training examples Xi
with ?I ? 0 (few) - Decide based on L(X) gt? 0
7O. Jangmin
8(No Transcript)
9QUESTION
- Running time of Fisher kernel SVM
- on query X?
10k-mer based SVMs
- Leslie, Eskin, Weston, Noble NIPS 2002
- Highlights
- K(X, Y) exp(-½ ?2 UX UY2), requires
expensive profile alignment - UX ?? log P(X H1, ?) O(X H1)
- Instead, new kernel K(X, Y) just counts up
k-mers with mismatches in common between X and
Y O(X) in practice - Off-the-shelf SVM software used
11k-mer based SVMs
- For given word size k, and mismatch tolerance l,
define -
- K(X, Y) distinct k-long word occurrences
with l mismatches - Define normalized kernel K(X, Y) K(X, Y)/
sqrt(K(X,X)K(Y,Y)) - SVM can be learned by supplying this kernel
function
A B A C A R D I
K(X, Y) 4 K(X, Y) 4/sqrt(77) 4/7
Let k 3 l 1
A B R A D A B I
12SVMs will find a few support vectors
After training, SVM has determined a small set of
sequences, the support vectors, who need to be
compared with query sequence X
v
13Benchmarks
14Semi-Supervised Methods
GENERATIVE SUPERVISED METHODS
15Semi-Supervised Methods
DISCRIMINATIVE SUPERVISED METHODS
16Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
17Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
18Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
19Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
20Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
21Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
22Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
23Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
24Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
25Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers Data generated by a fixed set
of centers (how many?)
26Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
27Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
- SVMs and other discriminative methods may make
significant mistakes due to lack of data
28Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
29Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
30Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
31Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
Attempt to contract the distances within each
cluster while keeping intracluster distances
larger
32Semi-Supervised Methods
- Some examples are labeled
- Assume labels vary smoothly among all examples
33Semi-Supervised Methods
- Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005
- A Psi-BLAST profilebased method
- Weston, Leslie, Elisseeff, Noble, NIPS 2003
- Cluster kernels
34(semi)1. Profile k-mer based SVMs
PSI-BLAST
Profile M
- For each sequence X,
- Obtain PSI-BLAST profile Q(X) pi(?) ? amino
acid, 1 i X - For every k-mer in X, xj xjk-1, define
?-neighborhood - Mk,? (Qxjxjk-1) b1bk -?i0k-1 log
pji(bi) lt ? - Define K(X, Y)
- For each b1bk matching m times in X, n times in
Y, add mn - In practice, each k-mer can have 2 mismatches
and K(X, Y) can be computed quickly in O(k2 202
(X Y))
35(semi)1. Discriminative motifs
- According to this kernel K(X, Y), sequence X is
mapped to Fk,?(X) vector in 20k dimensions - Fk,?(X)(b1bk) k-mers in Q(X) whose
neighborhood includes b1bk - Then, SVM learns a discriminating hyperplane
with normal vector v - v ?i1N (/-) ?i Fk,?(X(i))
- Consider a profile k-mer Qxjxjk-1 its
contribution to v is - ?Fk,?(Qxjxjk-1), v?
- Consider a position i in X count up the
contributions of all words containing xi - g(xi) ?j1k max 0, ?Fk,?(Qxi-kjxj-1j),
v? - Sort these contributions within all positions of
all sequences, to pick important positions or
discriminative motifs
36(semi)1. Discriminative motifs
- Consider a position i in X count up the
contributions to v of all words containing xi - Sort these contributions within all positions of
all sequences, to pick discriminative motifs
37(semi)2. Cluster Kernels
- Two (more!) methods
- Neighborhood
- For each X, run PSI-BLAST to get similar seqs ?
Nbd(X) - Define Fnbd(X) 1/Nbd(X) ?X ? Nbd(X)
Foriginal(X) - Counts of all k-mers matching with at most 1
diff. all sequences that are similar to X - Knbd(X, Y) 1/(Nbd(X)Nbd(Y)) ?X ? Nbd(X)
?Y ? Nbd(Y) K(X, Y) - Bagged mismatch
38(semi)2. Cluster Kernels
- Two (more!) methods
- Neighborhood
- For each X, run PSI-BLAST to get similar seqs ?
Nbd(X) - Define Fnbd(X) 1/Nbd(X) ?X ? Nbd(X)
Foriginal(X) - Counts of all k-mers matching with at most 1
diff. all sequences that are similar to X - Knbd(X, Y) 1/(Nbd(X)Nbd(Y)) ?X ? Nbd(X)
?Y ? Nbd(Y) K(X, Y) - Bagged mismatch
- Run k-means clustering n times, giving p 1,,n
assignments cp(X) - For every X and Y, count up the fraction of times
they are bagged together - Kbag(X, Y) 1/n ?p 1(cp(X) cp (Y))
- Combine the bag fraction with the original
comparison K(.,.) - Knew(X, Y) Kbag(X, Y) K(X, Y)
39Some Benchmarks
40Google-like homology search
- The internet and the network of protein
homologies have some similarityscale free - Given query X, Google ranks webpages by a flow
algorithm - From each webpage W, linked nbrs receive flow
- At time t1, W sends to nbrs flow it received at
time t - Finite, ergodic, aperiodic Markov Chain
- Can find stationary distribution efficiently as
left eigenvector with eigenvalue 1 - Start with arbitrary probability distribution,
and multiply by the transition matrix
41Google-like homology search
- Weston, Elisseeff, Zhu, Leslie, Noble, PNAS 2004
- RANKPROP algorithm for protein homology
- First, compute a matrix Kij of PSI-BLAST homology
between proteins i and j, normalized so that
?jKji 1 - Initialization y1(0) 1 yi(0) 0
- For t 0, 1, ,
- For i 2 to m
- yi(t1) K1i ??Kjiyj(t)
- In the end, let yi be the ranking score for
similarity of sequence i to sequence 1 - (? 0.95 is good)
42Google-like homology search
For a given protein family, what fraction of true
members of the family are ranked higher than the
first 50 non-members?
43Protein Structure Prediction
44Protein Structure Determination
- Experimental
- X-ray crystallography
- NMR spectrometry
- Computational Structure Prediction
- (The Holy Grail)
- Sequence implies structure, therefore in
principle we can predict the structure from the
sequence alone
45Protein Structure Prediction
- ab initio
- Use just first principles energy, geometry, and
kinematics - Homology
- Find the best match to a database of sequences
with known 3D-structure - Threading
- Meta-servers and other methods
46Ab initio Prediction
- Sampling the global conformation space
- Lattice models / Discrete-state models
- Molecular Dynamics
- Picking native conformations with an energy
function - Solvation model how protein interacts with water
- Pair interactions between amino acids
- Predicting secondary structure
- Local homology
- Fragment libraries
47Lattice String Folding
- HP model main modeled force is hydrophobic
attraction - NP-hard in both 2-D square and 3-D cubic
- Constant approximation algorithms
- Not so relevant biologically
48Lattice String Folding