Classification of GPCRs at Family and Subfamily Levels - PowerPoint PPT Presentation

About This Presentation

Title:

Classification of GPCRs at Family and Subfamily Levels

Description:

Classification of GPCRs at Family and Subfamily Levels. Using Decision Trees & Na ve Bayes Classifiers. Betty Yee Man Cheng. Language Technologies Institute, CMU ... – PowerPoint PPT presentation

Number of Views:167

Avg rating:3.0/5.0

Slides: 28

Provided by: ymch

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Classification of GPCRs at Family and Subfamily Levels

1
Classification of GPCRs at Family and Subfamily
Levels

Using Decision Trees
Naïve Bayes Classifiers

Betty Yee Man Cheng Language Technologies
Institute, CMU Advisors Judith
Klein-Seetharaman Jaime Carbonell
2
The Problem Motivation

Classify GPCRs using n-grams at the
Family Level
Level 1 Subfamily Level
Compare performance of decision trees and
naïve-bayes classifier to SVM and BLAST as
presented in Karchins paper
Determine the extent of the effect in using much
simpler classifiers

3
Baseline Karchin et al. Paper

Karchin, Karplus and Haussler. Classifying
G-protein coupled receptors with support vector
machines. Bioinformatics Vol. 18, no. 1, 2002,
p. 147-159.
Compares the performance of a 1-NN classifier
(BLAST), profile HMMs and SVMs in classifying
GPCRs at the level 1 and 2 subfamily levels (as
well as superfamily level)
Concludes that while SVMs are the most expensive
computationally, they are necessary to get
annotation-quality classification

4
Decision Trees
Machine Learning, Tom Mitchell, McGraw Hill,
1997. http//www-2.cs.cmu.edu/tom/mlbook-chapter-
slides.html
No
5
Why use Decision Trees?

Easy to interpret biological significance of
results
Nodes of the tree tell us which are the most
discriminating features
Prune decision tree to avoid overfitting and
improve accuracy on testing data
Use C4.5 software in this experiment
http//www.cse.unsw.edu.au/quinlan/

6
Naïve Bayes Classifier
Outlk sun Temp cool Humid high Wind
strong
Used Rainbow for this experiment http//www-2.cs.c
mu.edu/mccallum/bow/
7
Family Level Classification
Family of Proteins of GPCRs
Class A 1081 79.72
Class B 83 6.12
Class C 28 2.06
Class D 11 0.81
Class E 4 0.29
Class F 45 3.32
Orphan A 35 2.58
Orphan B 2 0.15
Bacterial Rhodopsin 23 1.70
Drosophila Odorant Receptors 31 2.29
Nematode Chemoreceptors 1 0.07
Ocular Albinism Proteins 2 0.15
Plant Mlo Receptors 10 0.74
8
Family Level Classes A-E

Decision Trees
Using counts of n-grams only
No sequence length information

N-grams Unpruned Tree Pruned Tree
1-grams 95.70 95.90
1, 2-grams 95.30 95.60
1, 2, 3-grams 96.70 96.70
9
More Difficult All Families
Decision Trees Decision Trees Decision Trees
N-grams Unpruned Tree Pruned Tree
1-grams 88.60 89.40
1, 2-grams 88.60 89.50
1, 2, 3-grams 88.20 89.30
Naïve Bayes Naïve Bayes Naïve Bayes
N-grams LaPlace Wittenbell
2-grams 86.52 90.30
2,3-grams 95.41 95.19
2,3,4-grams 90.59 94.44
2,3,4,5-grams 90.15 93.56
10
Class A/B Orphan A/B?

Train decision trees on proteins in all classes
except Orphan A and B
Test on Orphan A and B proteins
Are they classified as Class A and Class B?

N-grams Orphan A Orphan B
1-grams 31 Class A3 Class B, 1 Plant 1 Class B,1 Plant
1, 2-grams 30 Class A, 1 Class C,1 Class D, 2 Class F,1 Drosophila 2 Class A
1, 2, 3-grams 28 Class A, 1 Class B,3 Class C, 3 Drosophila 2 Class A
11
Feature Selection Needed

Large number of features is problematic in most
learning algorithms for training classifiers
Reducing the number of features tend to reduce
overfitting
Reduce number of features by feature extraction
or feature selection
Feature extraction new features are
combinations or transformations of given features
Feature selection new features are a subset of
original features
C4.5 cannot handle all 1, 2, 3, 4-grams

12
Some Feature Selection Methods

Document Frequency
Information Gain (a.k.a. Expected Mutual
Information)
Chi-Square
Correlation Coefficient
Relevancy Score

Best in document classification
13
Chi-Square

Sum (Expected Observed)2 / Expected
Expected
Observed
of sequences in the family with gt i
occurrences of n-gram j
For each n-gram x, find the ix with the largest
chi-square value
Sort n-grams according to these chi-values
Binary features for each chosen n-gram x,
whether x has more than ix occurrences
Frequency features frequency of each chosen
n-gram x

14
Effect of Chi2 Selected Binary1,2,3-grams in
Decision Trees
15
Effect of Chi2 Selected Frequency1,2,3-grams in
Decision Trees
16
Effect of Chi2 Selected Binary1,2,3-grams in
Naïve Bayes
17
Effect of Chi2 Selected Frequency1,2,3-grams in
Naïve Bayes
18
4-grams Not Useful!

4-grams are not useful in GPCR classification at
the family level according to chi-square

Top X Number of N-grams
Percent of which are Size N

Number of 1, 2, and 3-grams
21 212 213 9723 n-grams

19
Sequence Length is not Discriminating at Family
Level

Decision Trees using 1, 2, 3-grams and sequence
length
Length was used in only 1 out of 10 folds, at
level 11 in the decision tree
Addition of length changed accuracy from 89.3 to
89.4

20
Sequence Length is not Discriminating at Family
Level
21
Family Level Our Results

BLAST as a 1-NN classifier 94.32
Uses SWISS-PROT database
Decision Tree 91
600 top 1,2,3-gram frequencies from Chi-square
Naïve Bayes 96
2500 top 1,2,3-gram frequency features from
Chi-square

22
Level I Subfamily Classification

15 Class A Level I subfamilies
1207 sequences
4 Class C Level I subfamilies
62 sequences
Other sequences
149 sequences
archaea rhodopsins, G-alpha proteins
2-fold cross validation, dataset split as given
in the paper

23
Level I Subfamily Results
SVM 88.4
BLAST 83.3
SAM-T2K HMM 69.9
kernNN 64.0
Decision Trees (700 freq 1,2,3-grams) 78.40 77.25(no chi-square)
Naïve Bayes (2500 freq 1,2,3-grams) 89.51
24
Level II Subfamily Classification

65 Class A Level II subfamilies
1133 sequences
6 Class C Level II subfamilies
37 sequences
Other sequences
248 sequences
Archae rhodopsins
G-alpha proteins
Class A and class C sequences with no level II
subfamily classification or in level II
subfamilies with only 1 member
2-fold cross validation

25
Level II Subfamily Results
SVM 86.3
BLAST 74.5
SAM-T2K HMM 70.0
kernNN 51.0
Decision Trees (600 freq features) 69.0 66.0 (no chi-square)
Naïve Bayes (2500 freq features) 82.36
26
Conclusions

Naïve Bayes classifier seems to perform better
than Decision Trees in classifying GPCRs,
especially when used with Chi-square
At level I subfamily level, Naïve Bayes
surprisingly did better than SVM!
At level II subfamily level, it seems SVM may be
needed to gain an additional 4 accuracy.
However, further experiments are needed to check
as varying the number of features in Naïve Bayes
may make up the difference.

27
References

C4.5 Software
http//www.cse.unsw.edu.au/quinlan/
Karchin, Karplus and Haussler. Classifying
G-protein coupled receptors with support vector
machines. Bioinformatics Vol. 18, no. 1, 2002,
p. 147-159.
Machine Learning, Tom Mitchell, McGraw Hill,
1997.
http//www-2.cs.cmu.edu/tom/mlbook-chapter-slides
.html
Rainbow Software
http//www-2.cs.cmu.edu/mccallum/bow/
Sebastiani, Fabrizio. A Tutorial on Automated
Text Categorisation. Proceedings of ASAI-99,
1999.
http//citeseer.nj.nec.com/sebastiani99tutorial.ht
ml