Frequent Subsequence Protein Localization - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Frequent Subsequence Protein Localization

Description:

Protein: linear sequence of amino acids. Protein subcellular localization ... Sequence information alone (no additional data) Class imbalance ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 26
Provided by: Kev8144
Category:

less

Transcript and Presenter's Notes

Title: Frequent Subsequence Protein Localization


1
Frequent Sub-sequence Protein Localization
University of Alberta
  • O. Zaïane1, Y. Wang1, R. Goebel1 and G. J. Taylor2

1 Department of Computing Science 2
Department of Biological Sciences University of
Alberta, Canada
Workshop on Data Mining for Biomedical
Applications (BioDM'06) in Conjunction with PAKDD
Singapore, April 2006
2
The Context The EppDB Project
Data Entry
EppDB Extracytosolic Plant Protein DB
Gel Image
Tool Collection Protein Localization
Brassica Napus (Canola)
3
Introduction
  • Protein linear sequence of amino acids
  • Protein subcellular localization
  • Plant nuclear, cytoplasmic, mitochondria,
    extracellular,
  • Extracellular Plant Proteins Nutrient
    acquisition, communication with soil organisms,
    protection from pathogens, resistance to disease
    and toxic metals, etc.
  • Localization by biological experimental tests is
    too costly and time consuming ? data mining
  • Intracellular vs. Extracellular
  • Sequence information alone (no additional data)
  • Class imbalance
  • Model transparency desired for domain knowledge
    injection

4
Related Work
70
3
  • N-terminal sorting signals
  • Neural Networks
  • TargetP Emanuelsson et al. 2000 Accuracy 85
    Plants 90 non-Plants
  • SignalP Nielsen et al. 1997 68 human 83.7
    E.coli 79.3 Gram- 67.9 gram 70.2 Eukaryote
  • Amino acid composition
  • Neural Network and Support Vector Machines
  • NNPSL Reinhardt et al.1998 66 Eukaryote (non
    plants) 81 Prokaryotes
  • SubLoc Hua et al. 2001 91.4 Prokaryotes
    79.4 Eukaryote.
  • Lexical analysis
  • LOCKey Nair et al. 2002 87 on test data
  • PA-sub Lu 2002 98 overall
  • Integrative approach
  • Subsequence methods
  • She et al. 2002


5
Predicting Extracellular Proteins An Outline
Features
Methods
Experiments
Conclusion
Introduction
  • Feature Extraction
  • Methods for Classification
  • Support Vector Machine
  • Boosting
  • Frequent Pattern Method
  • Experimental Results
  • Future Directions

6
Feature Extraction
Features
Methods
Experiments
Conclusion
Introduction
  • Frequent subsequences subsequences that occur in
    more than a certain percentage of extracellular
    proteins
  • Strong Discriminative Power for FSS that are more
    frequent in extra or intracellular proteins
  • Same subsequences ? may perform similar functions
    via biochemical mechanisms
  • Capture local similarity ? may relate to
    functional or structural information

7
Generalized Suffix Tree
Features
Methods
Experiments
Conclusion
Introduction
  • There exist linear algorithms to construct a GST
    from sequences.
  • Traversing the GST provides the frequent
    subsequences

8
Support Vector Machine
Features
Methods
Experiments
Conclusion
Introduction
  • Input data represented as feature vectors
  • Find a linear separator that separate the data
    and maximize the margin
  • Kernel function nonlinear separator

9
SVM for extracellular protein prediction
Features
Methods
Experiments
Conclusion
Introduction
  • Data Transformation(sequence?vector)
  • Frequent subsequences as features
  • Transform protein sequence as binary vectors
    X(x1,x2,x3,,xn) xi in 0,1
  • Kernel Functions (map vector to higher dim space)
  • Linear kernel K(xi, x) xi
  • Polynomial kernel K(xi, x) (xi . x1)d
  • Radial Basic Func. Kernel K(xi, x) exp(-?
    xi-x2)

10
Boosting
Features
Methods
Experiments
Conclusion
Introduction
  • Iterative algorithms to improve weak classifier
  • Different weighted distribution of examples in
    each iteration
  • Increase the weights of incorrectly classified
    examples, and decrease the weights of correctly
    classified ones
  • We Use AdaBoost Schapire et al. ML 99

11
Frequent Pattern Method
Features
Methods
Experiments
Conclusion
Introduction
  • Frequent pattern X1X2Xn? extracellular
  • X1, X2,Xn are frequent subsequences
  • can be substituted to zero or up to MaxGap
    amino acids when matching a protein sequence
  • She et al. had similar idea for outer-membrane
    proteins but do not use MaxGap.
  • Admittedly No consideration of folding in 3D
    structure of protein when using MaxGap.

12
MaxGap
Features
Methods
Experiments
Conclusion
Introduction
  • Pattern X1X2 in S1 and S2
  • if X1 X2 are close in S1 and far apart in S2
  • The match in S1 is more likely to be
    biologically significant (when 3D structure not
    considered)
  • We do not consider a match when , in terms of
    amino acids between two adjacent sequences, is
    greater than MaxGap
  • ABCDEF does not match ABCMNOPQDEF but
    matches ABCMNODEF when MaxGap is 3.

13
Greedy Algorithm
Features
Methods
Experiments
Conclusion
Introduction
  • She at al. use an exhaustive search to build
    frequent pattern by concatenating patterns
  • We use a greedy algorithm We search for current
    best rule and reduce weights of positive examples
    covered by the rule, until total weight is less
    than threshold.
  • Best rule is found using Z-Number which
    indicates how well a rule discriminates examples
    of class C
  • Z-Number

SR Support of R aC mean of class SC/S
14
Experiments
Features
Methods
Experiments
Conclusion
Introduction
  • Hypothesis Frequent subsequences of amino acids
    have better discriminant power than amino acid
    composition for plant proteins
  • Dataset (Proteom Analysis Project at UofA)
  • Plant 3293 proteins, 171 extracellular
  • Five-cross validation

15
Evaluation Matrix
Features
Methods
Experiments
Conclusion
Introduction
  • Overall accuracy is not good enough
  • F-measure

16
Result(SVM with subsequence)
Features
Methods
Experiments
Conclusion
Introduction
17
Result(Boosting with subsequence)
Features
Methods
Experiments
Conclusion
Introduction
18
Result(Frequent Pattern)
Features
Methods
Experiments
Conclusion
Introduction
MinLen3 Min_gain0.1
Smallest size of a Pattern
Minimum Z-Number
Threshold Total weight
Rate of weight decrease
MinSup5 MinConf80 MaxGap300
Minimum Support of Rule
Minimum Confidence of Rule
Maximum Gap between consecutive sequen.
19
Result(SVM with composition)
Features
Methods
Experiments
Conclusion
Introduction
20
Result(Boosting with composition)
Features
Methods
Experiments
Conclusion
Introduction
21
Cross Comparison
Features
Methods
Experiments
Conclusion
Introduction
22
SVM with combined features
Features
Methods
Experiments
Conclusion
Introduction
Boosting with combined features
23
Effects of MinLen on SVM
Features
Methods
Experiments
Conclusion
Introduction
Effects of MinLen on boosting
24
Conclusion
Features
Methods
Experiments
Conclusion
Introduction
  • Presented three methods for identifying
    extracellular proteins based on frequent
    subsequence of amino acids
  • SVM achieves the best result
  • FSP method provides easily interpretable rules

25
Future Work
Features
Methods
Experiments
Conclusion
Introduction
  • Use for information about proteins (e.g.,
    structure, function, )
  • Integrating amino acid composition into FSP
    method
  • Use of Associative Classifier
  • Incorporate more biological knowledge
  • Use of Spatial Location of the sequence in
    protein Beginning, end or middle of protein,
    etc. divide protein in percentiles
  • Contrast set mining to discriminate classes

26
AdaBoost
Features
Methods
Experiments
Conclusion
Introduction
27
FOIL algorithm
Features
Methods
Experiments
Conclusion
Introduction
28
Features
Methods
Experiments
Conclusion
Introduction
Write a Comment
User Comments (0)
About PowerShow.com