Classification of GPCRs at Family and Subfamily Levels - PowerPoint PPT Presentation

About This Presentation
Title:

Classification of GPCRs at Family and Subfamily Levels

Description:

Classification of GPCRs at Family and Subfamily Levels. Using Decision Trees & Na ve Bayes Classifiers. Betty Yee Man Cheng. Language Technologies Institute, CMU ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 28
Provided by: ymch
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Classification of GPCRs at Family and Subfamily Levels


1
Classification of GPCRs at Family and Subfamily
Levels
  • Using Decision Trees
  • Naïve Bayes Classifiers

Betty Yee Man Cheng Language Technologies
Institute, CMU Advisors Judith
Klein-Seetharaman Jaime Carbonell
2
The Problem Motivation
  • Classify GPCRs using n-grams at the
  • Family Level
  • Level 1 Subfamily Level
  • Compare performance of decision trees and
    naïve-bayes classifier to SVM and BLAST as
    presented in Karchins paper
  • Determine the extent of the effect in using much
    simpler classifiers

3
Baseline Karchin et al. Paper
  • Karchin, Karplus and Haussler. Classifying
    G-protein coupled receptors with support vector
    machines. Bioinformatics Vol. 18, no. 1, 2002,
    p. 147-159.
  • Compares the performance of a 1-NN classifier
    (BLAST), profile HMMs and SVMs in classifying
    GPCRs at the level 1 and 2 subfamily levels (as
    well as superfamily level)
  • Concludes that while SVMs are the most expensive
    computationally, they are necessary to get
    annotation-quality classification

4
Decision Trees
Machine Learning, Tom Mitchell, McGraw Hill,
1997. http//www-2.cs.cmu.edu/tom/mlbook-chapter-
slides.html
No
5
Why use Decision Trees?
  • Easy to interpret biological significance of
    results
  • Nodes of the tree tell us which are the most
    discriminating features
  • Prune decision tree to avoid overfitting and
    improve accuracy on testing data
  • Use C4.5 software in this experiment
  • http//www.cse.unsw.edu.au/quinlan/

6
Naïve Bayes Classifier
Outlk sun Temp cool Humid high Wind
strong
Used Rainbow for this experiment http//www-2.cs.c
mu.edu/mccallum/bow/
7
Family Level Classification
Family of Proteins of GPCRs
Class A 1081 79.72
Class B 83 6.12
Class C 28 2.06
Class D 11 0.81
Class E 4 0.29
Class F 45 3.32
Orphan A 35 2.58
Orphan B 2 0.15
Bacterial Rhodopsin 23 1.70
Drosophila Odorant Receptors 31 2.29
Nematode Chemoreceptors 1 0.07
Ocular Albinism Proteins 2 0.15
Plant Mlo Receptors 10 0.74
8
Family Level Classes A-E
  • Decision Trees
  • Using counts of n-grams only
  • No sequence length information

N-grams Unpruned Tree Pruned Tree
1-grams 95.70 95.90
1, 2-grams 95.30 95.60
1, 2, 3-grams 96.70 96.70
9
More Difficult All Families
Decision Trees Decision Trees Decision Trees
N-grams Unpruned Tree Pruned Tree
1-grams 88.60 89.40
1, 2-grams 88.60 89.50
1, 2, 3-grams 88.20 89.30
Naïve Bayes Naïve Bayes Naïve Bayes
N-grams LaPlace Wittenbell
2-grams 86.52 90.30
2,3-grams 95.41 95.19
2,3,4-grams 90.59 94.44
2,3,4,5-grams 90.15 93.56
10
Class A/B Orphan A/B?
  • Train decision trees on proteins in all classes
    except Orphan A and B
  • Test on Orphan A and B proteins
  • Are they classified as Class A and Class B?

N-grams Orphan A Orphan B
1-grams 31 Class A3 Class B, 1 Plant 1 Class B,1 Plant
1, 2-grams 30 Class A, 1 Class C,1 Class D, 2 Class F,1 Drosophila 2 Class A
1, 2, 3-grams 28 Class A, 1 Class B,3 Class C, 3 Drosophila 2 Class A
11
Feature Selection Needed
  • Large number of features is problematic in most
    learning algorithms for training classifiers
  • Reducing the number of features tend to reduce
    overfitting
  • Reduce number of features by feature extraction
    or feature selection
  • Feature extraction new features are
    combinations or transformations of given features
  • Feature selection new features are a subset of
    original features
  • C4.5 cannot handle all 1, 2, 3, 4-grams

12
Some Feature Selection Methods
  • Document Frequency
  • Information Gain (a.k.a. Expected Mutual
    Information)
  • Chi-Square
  • Correlation Coefficient
  • Relevancy Score

Best in document classification
13
Chi-Square
  • Sum (Expected Observed)2 / Expected
  • Expected
  • Observed
  • of sequences in the family with gt i
    occurrences of n-gram j
  • For each n-gram x, find the ix with the largest
    chi-square value
  • Sort n-grams according to these chi-values
  • Binary features for each chosen n-gram x,
    whether x has more than ix occurrences
  • Frequency features frequency of each chosen
    n-gram x

14
Effect of Chi2 Selected Binary1,2,3-grams in
Decision Trees
15
Effect of Chi2 Selected Frequency1,2,3-grams in
Decision Trees
16
Effect of Chi2 Selected Binary1,2,3-grams in
Naïve Bayes
17
Effect of Chi2 Selected Frequency1,2,3-grams in
Naïve Bayes
18
4-grams Not Useful!
  • 4-grams are not useful in GPCR classification at
    the family level according to chi-square

Top X Number of N-grams
Percent of which are Size N
  • Number of 1, 2, and 3-grams
  • 21 212 213 9723 n-grams

19
Sequence Length is not Discriminating at Family
Level
  • Decision Trees using 1, 2, 3-grams and sequence
    length
  • Length was used in only 1 out of 10 folds, at
    level 11 in the decision tree
  • Addition of length changed accuracy from 89.3 to
    89.4

20
Sequence Length is not Discriminating at Family
Level
21
Family Level Our Results
  • BLAST as a 1-NN classifier 94.32
  • Uses SWISS-PROT database
  • Decision Tree 91
  • 600 top 1,2,3-gram frequencies from Chi-square
  • Naïve Bayes 96
  • 2500 top 1,2,3-gram frequency features from
    Chi-square

22
Level I Subfamily Classification
  • 15 Class A Level I subfamilies
  • 1207 sequences
  • 4 Class C Level I subfamilies
  • 62 sequences
  • Other sequences
  • 149 sequences
  • archaea rhodopsins, G-alpha proteins
  • 2-fold cross validation, dataset split as given
    in the paper

23
Level I Subfamily Results
SVM 88.4
BLAST 83.3
SAM-T2K HMM 69.9
kernNN 64.0
Decision Trees (700 freq 1,2,3-grams) 78.40 77.25(no chi-square)
Naïve Bayes (2500 freq 1,2,3-grams) 89.51
24
Level II Subfamily Classification
  • 65 Class A Level II subfamilies
  • 1133 sequences
  • 6 Class C Level II subfamilies
  • 37 sequences
  • Other sequences
  • 248 sequences
  • Archae rhodopsins
  • G-alpha proteins
  • Class A and class C sequences with no level II
    subfamily classification or in level II
    subfamilies with only 1 member
  • 2-fold cross validation

25
Level II Subfamily Results
SVM 86.3
BLAST 74.5
SAM-T2K HMM 70.0
kernNN 51.0
Decision Trees (600 freq features) 69.0 66.0 (no chi-square)
Naïve Bayes (2500 freq features) 82.36
26
Conclusions
  • Naïve Bayes classifier seems to perform better
    than Decision Trees in classifying GPCRs,
    especially when used with Chi-square
  • At level I subfamily level, Naïve Bayes
    surprisingly did better than SVM!
  • At level II subfamily level, it seems SVM may be
    needed to gain an additional 4 accuracy.
    However, further experiments are needed to check
    as varying the number of features in Naïve Bayes
    may make up the difference.

27
References
  • C4.5 Software
  • http//www.cse.unsw.edu.au/quinlan/
  • Karchin, Karplus and Haussler. Classifying
    G-protein coupled receptors with support vector
    machines. Bioinformatics Vol. 18, no. 1, 2002,
    p. 147-159.
  • Machine Learning, Tom Mitchell, McGraw Hill,
    1997.
  • http//www-2.cs.cmu.edu/tom/mlbook-chapter-slides
    .html
  • Rainbow Software
  • http//www-2.cs.cmu.edu/mccallum/bow/
  • Sebastiani, Fabrizio. A Tutorial on Automated
    Text Categorisation. Proceedings of ASAI-99,
    1999.
  • http//citeseer.nj.nec.com/sebastiani99tutorial.ht
    ml
Write a Comment
User Comments (0)
About PowerShow.com