Protein Homology Detection Using String Alignment Kernels - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Protein Homology Detection Using String Alignment Kernels

Description:

Problem: classification of protein sequence data into families and superfamilies ... SCOP: Structural Classification of Proteins ... – PowerPoint PPT presentation

Number of Views:167

Avg rating:3.0/5.0

Slides: 26

Provided by: christin268

Category:

more less

Transcript and Presenter's Notes

Title: Protein Homology Detection Using String Alignment Kernels

1
Protein Homology Detection Using String
Alignment Kernels

Jean-Phillippe Vert, Tatsuya Akutsu

2
Learning Sequence Based Protein Classification

Problem classification of protein sequence data
into families and superfamilies
Motivation Many proteins have been sequenced,
but often structure/function remains unknown
Motivation infer structure/function from
sequence-based classification

3
Sequence Data Versus Structure and function
Sequences for four chains of human hemoglobin
Tertiary Structure
gt1A3NA HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALE
RMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNA
LSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKF
LASVSTVLTSKYR gt1A3NB HEMOGLOBIN VHLTPEEKSAVTALWG
KVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGK
KVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLA
HHFGK EFTPPVQAAYQKVVAGVANALAHKYH gt1A3NC
HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT
KTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHA
HKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLT
SKYR gt1A3ND HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGE
ALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGL
AHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTP
PVQAAYQKVVAGVANALAHKYH
Function oxygen transport
4
Structural Hierarchy

SCOP Structural Classification of Proteins
Interested in superfamily-level homology remote
evolutionary relationship

Difficult !!
5
Learning Problem

Reduce to binary classification problem positive
() if example belongs to a family (e.g. G
proteins) or superfamily (e.g. nucleoside
triphosphate hydrolases), negative (-) otherwise
Focus on remote homology detection
Use supervised learning approach to train a
classifier

Labeled Training Sequences
Classification Rule
Learning Algorithm
6
Two supervised learning approaches to
classification

Generative model approach
Build a generative model for a single protein
family classify each candidate sequence based on
its fit to the model
Only uses positive training sequences
Discriminative approach
Learning algorithm tries to learn decision
boundary between positive and negative examples
Uses both positive and negative training
sequences

7
Targets of the current methods
8
Discriminative Learning

Discriminative approach
Train on both positive and negative
examples to learn classifier

Modern computational learning theory
Goal learn a classifier that generalizes well
to new examples
Do not use training data to estimate
parameters of probability distribution
curse of dimensionality

9
SVM for protein classification

Want to define feature map from space of protein
sequences to vector space
Goals
Computational efficiency
Competitive performance with known methods
No reliance on generative model general method
for sequence-based classification problems

10
Summary of the current kernel methods

Feature vector from HMM
Fisher kernel (Jaakkola et al., 2000)
Marginalized kernel (Tsuda et al., 2002)
Feature vector from sequence
Spectrum kernel (Leslie et al., 2002)
Mismatch kernel (Leslie et al., 2003)
Feature vector from other score
SVM pairwise (Liao Noble, 2002)

11
String Alignment Kernels

Observation SW alignment score provides measure
of similarity with biological knowledge on
protein evolution.
It can not be used as kernel because of lack of
positive definiteness.
A family of local alignment (LA) kernels that
mimic SW score are presented .

12
LA Kernels
Choose Feature Vector representation
Get Kernel by inner product of vectors
Other Kernels
LA Kernel
Measure similarity
Get valid kernel
13
LA Kernels

Pair score Kaß (x,y)
Gap kernel Kgß (x,y) for penalty gap model

?gt0, s is a symmetric similarity score.
with
d is gap opening and e is extension costs
14
LA Kernels

Kernel convolution
For ngt1, the string kernel can be expressed as

K01
K0 is initial part, succession of n aligned
residues Ka ß with n-1 possible gap Kg ß and a
terminal part K0.
15
LA Kernels
It is convergent for any x and y because of
finite number of non-null terms. It is a
point-wise limit of Mercer Kernels
16
LA with SW score

plocal alignment
p(x,y,p) score of local alignment p over x,y.
?set of all possible local alignment over x,y.

17
Why SW can not be kernel

1. SW only keep the best alignment instead of sum
of alignment of x,y.
2. Logrithm can destroy the property of being
postive definite.

18
Example
LA Kernel
SW score
19
SVM-pairwise
LA kernel
x
y
y
SW Score
x
Pair HMM
(0.9, 0.05, 0.3, 0.2)
(0.2, 0.3, 0.1, 0.01)
Inner Product
0.253
0.227
20
Diagonal Dominant Issue

It is the fact that K(x,x) is easily orders of
magnitude larger than K(x,y) of similar sequence
which bias the performance of SVM.

21
Diagonal Dominant Issue
(1) The eigen kernel LA-eig a. By
subtracting from the diagonal the smallest
negative eigenvalue of the training Gram matrix,
if there are negative eigenvalues. b.
LA-eig, is equal to except eventually on the
diagonal. (2) The empirical kernel map LA-ekm
22
Methods

Implementation
The computation of the kernel and
therefore of with a complexity in
O(x y), Using dynamic programming by a
slight modification of the SW algorithm.
Normaliztion
Dataset
4352 sequences extracted from the Astral database
(www.cs.columbia.edu/compbio/svmpairwise),
grouped into families and superfamilies.