Title: Frequent Subsequence Protein Localization
1Frequent Sub-sequence Protein Localization
University of Alberta
- O. Zaïane1, Y. Wang1, R. Goebel1 and G. J. Taylor2
1 Department of Computing Science 2
Department of Biological Sciences University of
Alberta, Canada
Workshop on Data Mining for Biomedical
Applications (BioDM'06) in Conjunction with PAKDD
Singapore, April 2006
2The Context The EppDB Project
Data Entry
EppDB Extracytosolic Plant Protein DB
Gel Image
Tool Collection Protein Localization
Brassica Napus (Canola)
3Introduction
- Protein linear sequence of amino acids
- Protein subcellular localization
- Plant nuclear, cytoplasmic, mitochondria,
extracellular, - Extracellular Plant Proteins Nutrient
acquisition, communication with soil organisms,
protection from pathogens, resistance to disease
and toxic metals, etc. - Localization by biological experimental tests is
too costly and time consuming ? data mining - Intracellular vs. Extracellular
- Sequence information alone (no additional data)
- Class imbalance
- Model transparency desired for domain knowledge
injection
4Related Work
70
3
- N-terminal sorting signals
- Neural Networks
- TargetP Emanuelsson et al. 2000 Accuracy 85
Plants 90 non-Plants - SignalP Nielsen et al. 1997 68 human 83.7
E.coli 79.3 Gram- 67.9 gram 70.2 Eukaryote
- Amino acid composition
- Neural Network and Support Vector Machines
- NNPSL Reinhardt et al.1998 66 Eukaryote (non
plants) 81 Prokaryotes - SubLoc Hua et al. 2001 91.4 Prokaryotes
79.4 Eukaryote. - Lexical analysis
- LOCKey Nair et al. 2002 87 on test data
- PA-sub Lu 2002 98 overall
- Integrative approach
- Subsequence methods
- She et al. 2002
5Predicting Extracellular Proteins An Outline
Features
Methods
Experiments
Conclusion
Introduction
- Feature Extraction
- Methods for Classification
- Support Vector Machine
- Boosting
- Frequent Pattern Method
- Experimental Results
- Future Directions
6Feature Extraction
Features
Methods
Experiments
Conclusion
Introduction
- Frequent subsequences subsequences that occur in
more than a certain percentage of extracellular
proteins - Strong Discriminative Power for FSS that are more
frequent in extra or intracellular proteins - Same subsequences ? may perform similar functions
via biochemical mechanisms - Capture local similarity ? may relate to
functional or structural information
7Generalized Suffix Tree
Features
Methods
Experiments
Conclusion
Introduction
- There exist linear algorithms to construct a GST
from sequences. - Traversing the GST provides the frequent
subsequences
8Support Vector Machine
Features
Methods
Experiments
Conclusion
Introduction
- Input data represented as feature vectors
- Find a linear separator that separate the data
and maximize the margin - Kernel function nonlinear separator
9SVM for extracellular protein prediction
Features
Methods
Experiments
Conclusion
Introduction
- Data Transformation(sequence?vector)
- Frequent subsequences as features
- Transform protein sequence as binary vectors
X(x1,x2,x3,,xn) xi in 0,1 - Kernel Functions (map vector to higher dim space)
- Linear kernel K(xi, x) xi
- Polynomial kernel K(xi, x) (xi . x1)d
- Radial Basic Func. Kernel K(xi, x) exp(-?
xi-x2)
10Boosting
Features
Methods
Experiments
Conclusion
Introduction
- Iterative algorithms to improve weak classifier
- Different weighted distribution of examples in
each iteration - Increase the weights of incorrectly classified
examples, and decrease the weights of correctly
classified ones - We Use AdaBoost Schapire et al. ML 99
11Frequent Pattern Method
Features
Methods
Experiments
Conclusion
Introduction
- Frequent pattern X1X2Xn? extracellular
- X1, X2,Xn are frequent subsequences
- can be substituted to zero or up to MaxGap
amino acids when matching a protein sequence - She et al. had similar idea for outer-membrane
proteins but do not use MaxGap. - Admittedly No consideration of folding in 3D
structure of protein when using MaxGap.
12MaxGap
Features
Methods
Experiments
Conclusion
Introduction
- Pattern X1X2 in S1 and S2
- if X1 X2 are close in S1 and far apart in S2
- The match in S1 is more likely to be
biologically significant (when 3D structure not
considered) - We do not consider a match when , in terms of
amino acids between two adjacent sequences, is
greater than MaxGap - ABCDEF does not match ABCMNOPQDEF but
matches ABCMNODEF when MaxGap is 3.
13Greedy Algorithm
Features
Methods
Experiments
Conclusion
Introduction
- She at al. use an exhaustive search to build
frequent pattern by concatenating patterns - We use a greedy algorithm We search for current
best rule and reduce weights of positive examples
covered by the rule, until total weight is less
than threshold. - Best rule is found using Z-Number which
indicates how well a rule discriminates examples
of class C - Z-Number
SR Support of R aC mean of class SC/S
14Experiments
Features
Methods
Experiments
Conclusion
Introduction
- Hypothesis Frequent subsequences of amino acids
have better discriminant power than amino acid
composition for plant proteins - Dataset (Proteom Analysis Project at UofA)
- Plant 3293 proteins, 171 extracellular
- Five-cross validation
15Evaluation Matrix
Features
Methods
Experiments
Conclusion
Introduction
- Overall accuracy is not good enough
- F-measure
16Result(SVM with subsequence)
Features
Methods
Experiments
Conclusion
Introduction
17Result(Boosting with subsequence)
Features
Methods
Experiments
Conclusion
Introduction
18Result(Frequent Pattern)
Features
Methods
Experiments
Conclusion
Introduction
MinLen3 Min_gain0.1
Smallest size of a Pattern
Minimum Z-Number
Threshold Total weight
Rate of weight decrease
MinSup5 MinConf80 MaxGap300
Minimum Support of Rule
Minimum Confidence of Rule
Maximum Gap between consecutive sequen.
19Result(SVM with composition)
Features
Methods
Experiments
Conclusion
Introduction
20Result(Boosting with composition)
Features
Methods
Experiments
Conclusion
Introduction
21Cross Comparison
Features
Methods
Experiments
Conclusion
Introduction
22SVM with combined features
Features
Methods
Experiments
Conclusion
Introduction
Boosting with combined features
23Effects of MinLen on SVM
Features
Methods
Experiments
Conclusion
Introduction
Effects of MinLen on boosting
24Conclusion
Features
Methods
Experiments
Conclusion
Introduction
- Presented three methods for identifying
extracellular proteins based on frequent
subsequence of amino acids - SVM achieves the best result
- FSP method provides easily interpretable rules
25Future Work
Features
Methods
Experiments
Conclusion
Introduction
- Use for information about proteins (e.g.,
structure, function, ) - Integrating amino acid composition into FSP
method - Use of Associative Classifier
- Incorporate more biological knowledge
- Use of Spatial Location of the sequence in
protein Beginning, end or middle of protein,
etc. divide protein in percentiles - Contrast set mining to discriminate classes
26AdaBoost
Features
Methods
Experiments
Conclusion
Introduction
27FOIL algorithm
Features
Methods
Experiments
Conclusion
Introduction
28Features
Methods
Experiments
Conclusion
Introduction