Title: Protein Homology Detection Using String Alignment Kernels
1Protein Homology Detection Using String
Alignment Kernels
- Jean-Phillippe Vert, Tatsuya Akutsu
2 Learning Sequence Based Protein Classification
- Problem classification of protein sequence data
into families and superfamilies - Motivation Many proteins have been sequenced,
but often structure/function remains unknown - Motivation infer structure/function from
sequence-based classification
3 Sequence Data Versus Structure and function
Sequences for four chains of human hemoglobin
Tertiary Structure
gt1A3NA HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALE
RMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNA
LSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKF
LASVSTVLTSKYR gt1A3NB HEMOGLOBIN VHLTPEEKSAVTALWG
KVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGK
KVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLA
HHFGK EFTPPVQAAYQKVVAGVANALAHKYH gt1A3NC
HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT
KTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHA
HKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLT
SKYR gt1A3ND HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGE
ALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGL
AHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTP
PVQAAYQKVVAGVANALAHKYH
Function oxygen transport
4 Structural Hierarchy
- SCOP Structural Classification of Proteins
- Interested in superfamily-level homology remote
evolutionary relationship
Difficult !!
5 Learning Problem
- Reduce to binary classification problem positive
() if example belongs to a family (e.g. G
proteins) or superfamily (e.g. nucleoside
triphosphate hydrolases), negative (-) otherwise - Focus on remote homology detection
- Use supervised learning approach to train a
classifier
Labeled Training Sequences
Classification Rule
Learning Algorithm
6 Two supervised learning approaches to
classification
- Generative model approach
- Build a generative model for a single protein
family classify each candidate sequence based on
its fit to the model - Only uses positive training sequences
- Discriminative approach
- Learning algorithm tries to learn decision
boundary between positive and negative examples - Uses both positive and negative training
sequences
7 Targets of the current methods
8 Discriminative Learning
- Discriminative approach
- Train on both positive and negative
- examples to learn classifier
- Modern computational learning theory
- Goal learn a classifier that generalizes well
- to new examples
- Do not use training data to estimate
- parameters of probability distribution
- curse of dimensionality
9 SVM for protein classification
- Want to define feature map from space of protein
sequences to vector space - Goals
- Computational efficiency
- Competitive performance with known methods
- No reliance on generative model general method
for sequence-based classification problems
10 Summary of the current kernel methods
- Feature vector from HMM
- Fisher kernel (Jaakkola et al., 2000)
- Marginalized kernel (Tsuda et al., 2002)
- Feature vector from sequence
- Spectrum kernel (Leslie et al., 2002)
- Mismatch kernel (Leslie et al., 2003)
- Feature vector from other score
- SVM pairwise (Liao Noble, 2002)
11 String Alignment Kernels
- Observation SW alignment score provides measure
of similarity with biological knowledge on
protein evolution. - It can not be used as kernel because of lack of
positive definiteness. - A family of local alignment (LA) kernels that
mimic SW score are presented .
12 LA Kernels
Choose Feature Vector representation
Get Kernel by inner product of vectors
Other Kernels
LA Kernel
Measure similarity
Get valid kernel
13 LA Kernels
- Pair score Kaß (x,y)
- Gap kernel Kgß (x,y) for penalty gap model
?gt0, s is a symmetric similarity score.
with
d is gap opening and e is extension costs
14 LA Kernels
- Kernel convolution
- For ngt1, the string kernel can be expressed as
K01
K0 is initial part, succession of n aligned
residues Ka ß with n-1 possible gap Kg ß and a
terminal part K0.
15 LA Kernels
It is convergent for any x and y because of
finite number of non-null terms. It is a
point-wise limit of Mercer Kernels
16 LA with SW score
- plocal alignment
- p(x,y,p) score of local alignment p over x,y.
- ?set of all possible local alignment over x,y.
17 Why SW can not be kernel
- 1. SW only keep the best alignment instead of sum
of alignment of x,y. - 2. Logrithm can destroy the property of being
postive definite.
18 Example
LA Kernel
SW score
19SVM-pairwise
LA kernel
x
y
y
SW Score
x
Pair HMM
(0.9, 0.05, 0.3, 0.2)
(0.2, 0.3, 0.1, 0.01)
Inner Product
0.253
0.227
20 Diagonal Dominant Issue
- It is the fact that K(x,x) is easily orders of
magnitude larger than K(x,y) of similar sequence
which bias the performance of SVM.
21 Diagonal Dominant Issue
(1) The eigen kernel LA-eig a. By
subtracting from the diagonal the smallest
negative eigenvalue of the training Gram matrix,
if there are negative eigenvalues. b.
LA-eig, is equal to except eventually on the
diagonal. (2) The empirical kernel map LA-ekm
22 Methods
- Implementation
- The computation of the kernel and
therefore of with a complexity in
O(x y), Using dynamic programming by a
slight modification of the SW algorithm. - Normaliztion
- Dataset
- 4352 sequences extracted from the Astral database
(www.cs.columbia.edu/compbio/svmpairwise),
grouped into families and superfamilies.
23 ROC Curve
24 ROC Curve
25 Summary for the kernels