Title: Classification of GPCRs at Family and Subfamily Levels
1Classification of GPCRs at Family and Subfamily
Levels
- Using Decision Trees
- Naïve Bayes Classifiers
Betty Yee Man Cheng Language Technologies
Institute, CMU Advisors Judith
Klein-Seetharaman Jaime Carbonell
2The Problem Motivation
- Classify GPCRs using n-grams at the
- Family Level
- Level 1 Subfamily Level
- Compare performance of decision trees and
naïve-bayes classifier to SVM and BLAST as
presented in Karchins paper - Determine the extent of the effect in using much
simpler classifiers
3Baseline Karchin et al. Paper
- Karchin, Karplus and Haussler. Classifying
G-protein coupled receptors with support vector
machines. Bioinformatics Vol. 18, no. 1, 2002,
p. 147-159. - Compares the performance of a 1-NN classifier
(BLAST), profile HMMs and SVMs in classifying
GPCRs at the level 1 and 2 subfamily levels (as
well as superfamily level) - Concludes that while SVMs are the most expensive
computationally, they are necessary to get
annotation-quality classification
4Decision Trees
Machine Learning, Tom Mitchell, McGraw Hill,
1997. http//www-2.cs.cmu.edu/tom/mlbook-chapter-
slides.html
No
5Why use Decision Trees?
- Easy to interpret biological significance of
results - Nodes of the tree tell us which are the most
discriminating features - Prune decision tree to avoid overfitting and
improve accuracy on testing data - Use C4.5 software in this experiment
- http//www.cse.unsw.edu.au/quinlan/
6Naïve Bayes Classifier
Outlk sun Temp cool Humid high Wind
strong
Used Rainbow for this experiment http//www-2.cs.c
mu.edu/mccallum/bow/
7Family Level Classification
Family of Proteins of GPCRs
Class A 1081 79.72
Class B 83 6.12
Class C 28 2.06
Class D 11 0.81
Class E 4 0.29
Class F 45 3.32
Orphan A 35 2.58
Orphan B 2 0.15
Bacterial Rhodopsin 23 1.70
Drosophila Odorant Receptors 31 2.29
Nematode Chemoreceptors 1 0.07
Ocular Albinism Proteins 2 0.15
Plant Mlo Receptors 10 0.74
8Family Level Classes A-E
- Decision Trees
- Using counts of n-grams only
- No sequence length information
N-grams Unpruned Tree Pruned Tree
1-grams 95.70 95.90
1, 2-grams 95.30 95.60
1, 2, 3-grams 96.70 96.70
9More Difficult All Families
Decision Trees Decision Trees Decision Trees
N-grams Unpruned Tree Pruned Tree
1-grams 88.60 89.40
1, 2-grams 88.60 89.50
1, 2, 3-grams 88.20 89.30
Naïve Bayes Naïve Bayes Naïve Bayes
N-grams LaPlace Wittenbell
2-grams 86.52 90.30
2,3-grams 95.41 95.19
2,3,4-grams 90.59 94.44
2,3,4,5-grams 90.15 93.56
10Class A/B Orphan A/B?
- Train decision trees on proteins in all classes
except Orphan A and B - Test on Orphan A and B proteins
- Are they classified as Class A and Class B?
N-grams Orphan A Orphan B
1-grams 31 Class A3 Class B, 1 Plant 1 Class B,1 Plant
1, 2-grams 30 Class A, 1 Class C,1 Class D, 2 Class F,1 Drosophila 2 Class A
1, 2, 3-grams 28 Class A, 1 Class B,3 Class C, 3 Drosophila 2 Class A
11Feature Selection Needed
- Large number of features is problematic in most
learning algorithms for training classifiers - Reducing the number of features tend to reduce
overfitting - Reduce number of features by feature extraction
or feature selection - Feature extraction new features are
combinations or transformations of given features - Feature selection new features are a subset of
original features - C4.5 cannot handle all 1, 2, 3, 4-grams
12Some Feature Selection Methods
- Document Frequency
- Information Gain (a.k.a. Expected Mutual
Information) - Chi-Square
- Correlation Coefficient
- Relevancy Score
Best in document classification
13Chi-Square
- Sum (Expected Observed)2 / Expected
- Expected
- Observed
- of sequences in the family with gt i
occurrences of n-gram j - For each n-gram x, find the ix with the largest
chi-square value - Sort n-grams according to these chi-values
- Binary features for each chosen n-gram x,
whether x has more than ix occurrences - Frequency features frequency of each chosen
n-gram x
14Effect of Chi2 Selected Binary1,2,3-grams in
Decision Trees
15Effect of Chi2 Selected Frequency1,2,3-grams in
Decision Trees
16Effect of Chi2 Selected Binary1,2,3-grams in
Naïve Bayes
17Effect of Chi2 Selected Frequency1,2,3-grams in
Naïve Bayes
184-grams Not Useful!
- 4-grams are not useful in GPCR classification at
the family level according to chi-square
Top X Number of N-grams
Percent of which are Size N
- Number of 1, 2, and 3-grams
- 21 212 213 9723 n-grams
19Sequence Length is not Discriminating at Family
Level
- Decision Trees using 1, 2, 3-grams and sequence
length - Length was used in only 1 out of 10 folds, at
level 11 in the decision tree - Addition of length changed accuracy from 89.3 to
89.4
20Sequence Length is not Discriminating at Family
Level
21Family Level Our Results
- BLAST as a 1-NN classifier 94.32
- Uses SWISS-PROT database
- Decision Tree 91
- 600 top 1,2,3-gram frequencies from Chi-square
- Naïve Bayes 96
- 2500 top 1,2,3-gram frequency features from
Chi-square
22Level I Subfamily Classification
- 15 Class A Level I subfamilies
- 1207 sequences
- 4 Class C Level I subfamilies
- 62 sequences
- Other sequences
- 149 sequences
- archaea rhodopsins, G-alpha proteins
- 2-fold cross validation, dataset split as given
in the paper
23Level I Subfamily Results
SVM 88.4
BLAST 83.3
SAM-T2K HMM 69.9
kernNN 64.0
Decision Trees (700 freq 1,2,3-grams) 78.40 77.25(no chi-square)
Naïve Bayes (2500 freq 1,2,3-grams) 89.51
24Level II Subfamily Classification
- 65 Class A Level II subfamilies
- 1133 sequences
- 6 Class C Level II subfamilies
- 37 sequences
- Other sequences
- 248 sequences
- Archae rhodopsins
- G-alpha proteins
- Class A and class C sequences with no level II
subfamily classification or in level II
subfamilies with only 1 member - 2-fold cross validation
25Level II Subfamily Results
SVM 86.3
BLAST 74.5
SAM-T2K HMM 70.0
kernNN 51.0
Decision Trees (600 freq features) 69.0 66.0 (no chi-square)
Naïve Bayes (2500 freq features) 82.36
26Conclusions
- Naïve Bayes classifier seems to perform better
than Decision Trees in classifying GPCRs,
especially when used with Chi-square - At level I subfamily level, Naïve Bayes
surprisingly did better than SVM! - At level II subfamily level, it seems SVM may be
needed to gain an additional 4 accuracy.
However, further experiments are needed to check
as varying the number of features in Naïve Bayes
may make up the difference.
27References
- C4.5 Software
- http//www.cse.unsw.edu.au/quinlan/
- Karchin, Karplus and Haussler. Classifying
G-protein coupled receptors with support vector
machines. Bioinformatics Vol. 18, no. 1, 2002,
p. 147-159. - Machine Learning, Tom Mitchell, McGraw Hill,
1997. - http//www-2.cs.cmu.edu/tom/mlbook-chapter-slides
.html - Rainbow Software
- http//www-2.cs.cmu.edu/mccallum/bow/
- Sebastiani, Fabrizio. A Tutorial on Automated
Text Categorisation. Proceedings of ASAI-99,
1999. - http//citeseer.nj.nec.com/sebastiani99tutorial.ht
ml