Frequent Subsequence Protein Localization - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Frequent Subsequence Protein Localization

Description:

Protein: linear sequence of amino acids. Protein subcellular localization ... Sequence information alone (no additional data) Class imbalance ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 26

Provided by: Kev8144

Category:

more less

Transcript and Presenter's Notes

Title: Frequent Subsequence Protein Localization

1
Frequent Sub-sequence Protein Localization
University of Alberta

O. Zaïane1, Y. Wang1, R. Goebel1 and G. J. Taylor2

1 Department of Computing Science 2
Department of Biological Sciences University of
Alberta, Canada
Workshop on Data Mining for Biomedical
Applications (BioDM'06) in Conjunction with PAKDD
Singapore, April 2006
2
The Context The EppDB Project
Data Entry
EppDB Extracytosolic Plant Protein DB
Gel Image
Tool Collection Protein Localization
Brassica Napus (Canola)
3
Introduction

Protein linear sequence of amino acids
Protein subcellular localization
Plant nuclear, cytoplasmic, mitochondria,
extracellular,
Extracellular Plant Proteins Nutrient
acquisition, communication with soil organisms,
protection from pathogens, resistance to disease
and toxic metals, etc.
Localization by biological experimental tests is
too costly and time consuming ? data mining
Intracellular vs. Extracellular
Sequence information alone (no additional data)
Class imbalance
Model transparency desired for domain knowledge
injection

4
Related Work
70
3

N-terminal sorting signals
Neural Networks
TargetP Emanuelsson et al. 2000 Accuracy 85
Plants 90 non-Plants
SignalP Nielsen et al. 1997 68 human 83.7
E.coli 79.3 Gram- 67.9 gram 70.2 Eukaryote
Amino acid composition
Neural Network and Support Vector Machines
NNPSL Reinhardt et al.1998 66 Eukaryote (non
plants) 81 Prokaryotes
SubLoc Hua et al. 2001 91.4 Prokaryotes
79.4 Eukaryote.
Lexical analysis
LOCKey Nair et al. 2002 87 on test data
PA-sub Lu 2002 98 overall
Integrative approach
Subsequence methods
She et al. 2002

5
Predicting Extracellular Proteins An Outline
Features
Methods
Experiments
Conclusion
Introduction

Feature Extraction
Methods for Classification
Support Vector Machine
Boosting
Frequent Pattern Method
Experimental Results
Future Directions

6
Feature Extraction
Features
Methods
Experiments
Conclusion
Introduction

Frequent subsequences subsequences that occur in
more than a certain percentage of extracellular
proteins
Strong Discriminative Power for FSS that are more
frequent in extra or intracellular proteins
Same subsequences ? may perform similar functions
via biochemical mechanisms
Capture local similarity ? may relate to
functional or structural information

7
Generalized Suffix Tree
Features
Methods
Experiments
Conclusion
Introduction

There exist linear algorithms to construct a GST
from sequences.
Traversing the GST provides the frequent
subsequences

8
Support Vector Machine
Features
Methods
Experiments
Conclusion
Introduction

Input data represented as feature vectors
Find a linear separator that separate the data
and maximize the margin
Kernel function nonlinear separator

9
SVM for extracellular protein prediction
Features
Methods
Experiments
Conclusion
Introduction

Data Transformation(sequence?vector)
Frequent subsequences as features
Transform protein sequence as binary vectors
X(x1,x2,x3,,xn) xi in 0,1
Kernel Functions (map vector to higher dim space)
Linear kernel K(xi, x) xi
Polynomial kernel K(xi, x) (xi . x1)d
Radial Basic Func. Kernel K(xi, x) exp(-?
xi-x2)

10
Boosting
Features
Methods
Experiments
Conclusion
Introduction

Iterative algorithms to improve weak classifier
Different weighted distribution of examples in
each iteration
Increase the weights of incorrectly classified
examples, and decrease the weights of correctly
classified ones
We Use AdaBoost Schapire et al. ML 99

11
Frequent Pattern Method
Features
Methods
Experiments
Conclusion
Introduction

Frequent pattern X1X2Xn? extracellular
X1, X2,Xn are frequent subsequences
can be substituted to zero or up to MaxGap
amino acids when matching a protein sequence
She et al. had similar idea for outer-membrane
proteins but do not use MaxGap.
Admittedly No consideration of folding in 3D
structure of protein when using MaxGap.

12
MaxGap
Features
Methods
Experiments
Conclusion
Introduction

Pattern X1X2 in S1 and S2
if X1 X2 are close in S1 and far apart in S2
The match in S1 is more likely to be
biologically significant (when 3D structure not
considered)
We do not consider a match when , in terms of
amino acids between two adjacent sequences, is
greater than MaxGap
ABCDEF does not match ABCMNOPQDEF but
matches ABCMNODEF when MaxGap is 3.

13
Greedy Algorithm
Features
Methods
Experiments
Conclusion
Introduction

She at al. use an exhaustive search to build
frequent pattern by concatenating patterns
We use a greedy algorithm We search for current
best rule and reduce weights of positive examples
covered by the rule, until total weight is less
than threshold.
Best rule is found using Z-Number which
indicates how well a rule discriminates examples
of class C
Z-Number

SR Support of R aC mean of class SC/S
14
Experiments
Features
Methods
Experiments
Conclusion
Introduction

Hypothesis Frequent subsequences of amino acids
have better discriminant power than amino acid
composition for plant proteins
Dataset (Proteom Analysis Project at UofA)
Plant 3293 proteins, 171 extracellular
Five-cross validation

15
Evaluation Matrix
Features
Methods
Experiments
Conclusion
Introduction

Overall accuracy is not good enough
F-measure

16
Result(SVM with subsequence)
Features
Methods
Experiments
Conclusion
Introduction
17
Result(Boosting with subsequence)
Features
Methods
Experiments
Conclusion
Introduction
18
Result(Frequent Pattern)
Features
Methods
Experiments
Conclusion
Introduction
MinLen3 Min_gain0.1
Smallest size of a Pattern
Minimum Z-Number
Threshold Total weight
Rate of weight decrease
MinSup5 MinConf80 MaxGap300
Minimum Support of Rule
Minimum Confidence of Rule
Maximum Gap between consecutive sequen.
19
Result(SVM with composition)
Features
Methods
Experiments
Conclusion
Introduction
20
Result(Boosting with composition)
Features
Methods
Experiments
Conclusion
Introduction
21
Cross Comparison
Features
Methods
Experiments
Conclusion
Introduction
22
SVM with combined features
Features
Methods
Experiments
Conclusion
Introduction
Boosting with combined features
23
Effects of MinLen on SVM
Features
Methods
Experiments
Conclusion
Introduction
Effects of MinLen on boosting
24
Conclusion
Features
Methods
Experiments
Conclusion
Introduction

Presented three methods for identifying
extracellular proteins based on frequent
subsequence of amino acids
SVM achieves the best result
FSP method provides easily interpretable rules

25
Future Work
Features
Methods
Experiments
Conclusion
Introduction

Use for information about proteins (e.g.,
structure, function, )
Integrating amino acid composition into FSP
method
Use of Associative Classifier
Incorporate more biological knowledge
Use of Spatial Location of the sequence in
protein Beginning, end or middle of protein,
etc. divide protein in percentiles
Contrast set mining to discriminate classes

26
AdaBoost
Features
Methods
Experiments
Conclusion
Introduction
27
FOIL algorithm
Features
Methods
Experiments
Conclusion
Introduction
28
Features
Methods
Experiments
Conclusion
Introduction

Write a Comment

User Comments (0)