Boosting For Tumor Classification With Gene Expression Data

1 / 21
About This Presentation
Title:

Boosting For Tumor Classification With Gene Expression Data

Description:

Bioinformatics, Vol.19, no.9 2003, 1061-1069. Hojung Cho. Topics for Bioinformatics. Oct 10 2006. Outline. Background. Microarrays. Scoring Algorithm ... –

Number of Views:109
Avg rating:3.0/5.0
Slides: 22
Provided by: kri95
Category:

less

Transcript and Presenter's Notes

Title: Boosting For Tumor Classification With Gene Expression Data


1
Boosting For Tumor Classification With Gene
Expression Data
  • Marcel Dettling and Peter Buhlmann
  • Bioinformatics, Vol.19, no.9 2003, 1061-1069

Hojung Cho Topics for Bioinformatics Oct 10
2006
2
Outline
  • Background
  • Microarrays
  • Scoring Algorithm
  • Decision Stumps
  • Boosting (primer)
  • Methods
  • Feature preselection
  • LogitBoost
  • Choice of the stopping parameters
  • Multiclass approach
  • Results
  • Data preprocessing
  • Error rates
  • ROC curves
  • Validation of the results
  • Simulation
  • Discussion

3
Microarray data (pn)
Park et al, 2001
Boosting decision trees applied for the
classification of gene expression data (Ben-Dor
et al (2000), Dudoit et al.(2002)) AdaBoost did
not yield good results compared to other
classifiers
4
  • The objective
  • improve the performance of boosting for
    classification of gene expression data by
    modifying the algorithm
  • The strategies
  • Feature preselection nonparametric scoring
    method
  • Binary LogitBoost with decision stumps
  • Multiclass approach reduce to multiple binary
    classification

5
Background Scoring Algorithm
  • Nonparametric
  • Allows data analysis w/o assuming underlying
    distribution
  • Feature preselection
  • Score each gene according to its strength for
    phenotype discrimination
  • Consider only genes differentially expressed
    across samples

6
Scoring Algorithm
  • Sort expression levels
  • Group membership determined by resulting sequence
    of 0s and 1s
  • Correspondence between expression levels and
    group membership determined by how well clustered
    together
  • Score defined as smallest
  • number of swaps of
  • consecutive digits necessary
  • to arrive at a perfect splitting
  • (this example score 4)

7
Background Scoring Algorithm
  • Allows ordering of genes according to their
    potential significance
  • Captures to what extent a gene discriminates the
    response categories
  • Both the near zero and the maximum score n0n1
    indicate a differentially expressed gene
  • Quality measure
  • restrict boosting classifier to work with this
    subset

8
Background Decision Stumps
Test attribute
  • Decision trees with only a single split
  • The presence or absence of a single term as a
    predicate
  • predict basketball player if and only if height
    gt 2m
  • Base learner for boosting procedure
  • Weak learner
  • Subroutine returns a hypothesis given some finite
    training set
  • Performs slightly better than random choice may
    be enhanced combined with the feature preselection

Label A
Label B
9
Background Boosting
  • AdaBoost does not yield very impressive results
  • LogitBoost
  • Relies on the binomial log-likelihood as a loss
    function
  • Found to have a slight edge over AdaBoost in many
    classification problems
  • Usually performs better on noisy data or when
    there are misspecifications or inhomogeneities of
    the class labels in the training data (common in
    microarray)

10
Methods
  • Rank and Select Features
  • selection of genes with the highest values of
    the quality measure
  • the number of p can be determined by CV.
  • Train LogitBoost classification
  • using decision stumps
  • as weak learners
  • Choice of the stopping parameter
  • Leave-One-Out Cross-Validation find m for
    maximizing ( l(m) )

11
LogitBoost Algorithm
  • p(x) PY1Xx , p(0)(x) ½ , F(0)(x) 0

12
Reducing multiclass to binary
  • Multiclass -gt Multibinary
  • Match each class against all the other classes
    (one-against-all)
  • Combine binary probabilities to multiclass
    probabilities

probability estimates for Y j via normalization,
Plugged into the Bayes classifier,
13
Sample data sets
Data preprocessing Leukemia, NCI thresholding
(floor 100, ceiling 16000), filtering (fold
change gt 5, max min gt 500), log transform,
normalization Colon log transform,
normalization Estrogen and Nodal thresholding,
log transform, normalization Lymphoma
normalization
14
Results Error Rates (1) The test error using
symmetrically equal misclassification costs
15
Results Error Rates (2)
16
ResultsNo. of Iterations Performance
The choice of the stopping parameter for boosting
is not very critical in all six
datasets. stopping after a large, but arbitrary
number of 100 iterations is a reasonable strategy
in the microarray data
17
Results ROC curves
  • The test error using asymmetric misclassification
    costs
  • Both boosting classifiers yield the similar curve
    to the ideal ROC curve (red line) than the one
    from classification trees. Boosting has an
    advantage for small negative rates

18
Validation of the results
19
Simulation
20
Conclusion
  • Feature preselection generally improved the
    predictive power
  • Slightly better performance of LogitBoost over
    AdaBoost
  • Reducing multiclass problems to multiple binary
    problems yielded more accurate results

21
Discussion
  • Marginal and far from significant edge of
    LogitBoost over AdaBoost
  • Did feature preselection really improve the
    performance?
  • Manipulated to make LogitBoost perform better?
  • Cross-validation of algorithms with published
    data
  • Authors have other considerations than simple
    performance of the algorithms with the training
    datasets
  • Leave-One-Out is just one way to cross-validate
  • Biological interpretation
Write a Comment
User Comments (0)
About PowerShow.com