Title: Protein Classification Using Averaged Perceptron SVM
1Protein Classification Using Averaged Perceptron
SVM
CS6772 Project Presentation 12/03/2003
2Protein Sequence Classification
- Protein (?) ? 20 amino acids
- Easy to sequence proteins, difficult to obtain
structure
3D Structure
Sequence
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL
LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
?
Class Globin family Globin-like superfamily
Function Oxygen transport
3Sequence Alignment vs. Classification
- Sequence similarity through alignment
distant homology
SGFIEEDELKLFL SGFIEEEELKFVL
close homology
- Sequence classification for remote homology
Classifier
4Structural Hierarchy of Proteins
SCOP
Fold
Superfamily
Negative Test Set
Negative Training Set
Family
Positive Test Set
Positive Training Set
- Remote homologs
- Structure and function conserved
- Sequence similarity - low
5Remote Homology Detection
- Discriminative supervised learning approach to
protein classification
Approach Support Vector Machines with String
Kernels
C. Leslie, E. Eskin, J. Weston, and W. Noble,
Mismatch String Kernels for SVM Protein
Classification. C. Leslie and R. Kuang, Fast
Kernels for Inexact String Matching.
6QP SVM Training
Sequence Training Data
gtVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS
HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK
LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR gtTYFP
HFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVD
PVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR To
tal n sequences n labels
Learned Weights and Bias
QP Solver (slow)
From KKT
7Averaged Perceptron SVM Training
Training Algorithm
Y. Freund and R. Schapire, Large Margin
Classification Using the Perceptron Algorithm.
8Averaged Perceptron SVM Training
Iterate t Epochs
Sequence Training Data
Run Perceptron Algorithm
gtVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS
HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK
LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR gtTYFP
HFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVD
PVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR To
tal n sequences n labels
Generalized Bound for k
Final Weight Vector, Voting Weights
s no. of dimensions in feature space k no. of
mistakes made during perceptron run
SCOP experiments show For average n
1000 Average k 50-60
9Averaged Perceptron SVM Classification
Testing Algorithm
Note Only k kernel products with unknown
sequence x need to be computed. Recurrence
relation
M is the set of mistake indices
10Implementation Details
- Built on top of protclass (Protein
Classification) platform - Java Platform
- Classification Task
- Classification Task
- Hash table scan instead of Mismatch Trie
- Generate mismatch mappings once using shifts
- Dynamic kernel matrix storage
- Still needs debugging
- Speed/Space Performance
- 80 reduction in space requirement
- 50 reduction in training time
- 50 reduction in testing time
- Mainly from simple online algorithm
11(No Transcript)