Boosting For Tumor Classification With Gene Expression Data

1 / 21

About This Presentation

Title:

Boosting For Tumor Classification With Gene Expression Data

Description:

Bioinformatics, Vol.19, no.9 2003, 1061-1069. Hojung Cho. Topics for Bioinformatics. Oct 10 2006. Outline. Background. Microarrays. Scoring Algorithm ... –

Number of Views:109

Avg rating:3.0/5.0

Slides: 22

Provided by: kri95

Category:

more less

Transcript and Presenter's Notes

Title: Boosting For Tumor Classification With Gene Expression Data

1
Boosting For Tumor Classification With Gene
Expression Data

Marcel Dettling and Peter Buhlmann
Bioinformatics, Vol.19, no.9 2003, 1061-1069

Hojung Cho Topics for Bioinformatics Oct 10
2006
2
Outline

Background
Microarrays
Scoring Algorithm
Decision Stumps
Boosting (primer)
Methods
Feature preselection
LogitBoost
Choice of the stopping parameters
Multiclass approach
Results
Data preprocessing
Error rates
ROC curves
Validation of the results
Simulation
Discussion

3
Microarray data (pn)
Park et al, 2001
Boosting decision trees applied for the
classification of gene expression data (Ben-Dor
et al (2000), Dudoit et al.(2002)) AdaBoost did
not yield good results compared to other
classifiers
4

The objective
improve the performance of boosting for
classification of gene expression data by
modifying the algorithm
The strategies
Feature preselection nonparametric scoring
method
Binary LogitBoost with decision stumps
Multiclass approach reduce to multiple binary
classification

5
Background Scoring Algorithm

Nonparametric
Allows data analysis w/o assuming underlying
distribution
Feature preselection
Score each gene according to its strength for
phenotype discrimination
Consider only genes differentially expressed
across samples

6
Scoring Algorithm

Sort expression levels
Group membership determined by resulting sequence
of 0s and 1s
Correspondence between expression levels and
group membership determined by how well clustered
together
Score defined as smallest
number of swaps of
consecutive digits necessary
to arrive at a perfect splitting
(this example score 4)

7
Background Scoring Algorithm

Allows ordering of genes according to their
potential significance
Captures to what extent a gene discriminates the
response categories
Both the near zero and the maximum score n0n1
indicate a differentially expressed gene
Quality measure
restrict boosting classifier to work with this
subset

8
Background Decision Stumps
Test attribute

Decision trees with only a single split
The presence or absence of a single term as a
predicate
predict basketball player if and only if height
gt 2m
Base learner for boosting procedure
Weak learner
Subroutine returns a hypothesis given some finite
training set
Performs slightly better than random choice may
be enhanced combined with the feature preselection

Label A
Label B
9
Background Boosting

AdaBoost does not yield very impressive results
LogitBoost
Relies on the binomial log-likelihood as a loss
function
Found to have a slight edge over AdaBoost in many
classification problems
Usually performs better on noisy data or when
there are misspecifications or inhomogeneities of
the class labels in the training data (common in
microarray)

10
Methods

Rank and Select Features
selection of genes with the highest values of
the quality measure
the number of p can be determined by CV.

Train LogitBoost classification
using decision stumps
as weak learners

Choice of the stopping parameter
Leave-One-Out Cross-Validation find m for
maximizing ( l(m) )

11
LogitBoost Algorithm

p(x) PY1Xx , p(0)(x) ½ , F(0)(x) 0

12
Reducing multiclass to binary

Multiclass -gt Multibinary
Match each class against all the other classes
(one-against-all)
Combine binary probabilities to multiclass
probabilities

probability estimates for Y j via normalization,
Plugged into the Bayes classifier,
13
Sample data sets
Data preprocessing Leukemia, NCI thresholding
(floor 100, ceiling 16000), filtering (fold
change gt 5, max min gt 500), log transform,
normalization Colon log transform,
normalization Estrogen and Nodal thresholding,
log transform, normalization Lymphoma
normalization
14
Results Error Rates (1) The test error using
symmetrically equal misclassification costs
15
Results Error Rates (2)
16
ResultsNo. of Iterations Performance
The choice of the stopping parameter for boosting
is not very critical in all six
datasets. stopping after a large, but arbitrary
number of 100 iterations is a reasonable strategy
in the microarray data
17
Results ROC curves

The test error using asymmetric misclassification
costs
Both boosting classifiers yield the similar curve
to the ideal ROC curve (red line) than the one
from classification trees. Boosting has an
advantage for small negative rates

18
Validation of the results
19
Simulation
20
Conclusion

Feature preselection generally improved the
predictive power
Slightly better performance of LogitBoost over
AdaBoost
Reducing multiclass problems to multiple binary
problems yielded more accurate results

21
Discussion

Marginal and far from significant edge of
LogitBoost over AdaBoost
Did feature preselection really improve the
performance?
Manipulated to make LogitBoost perform better?
Cross-validation of algorithms with published
data
Authors have other considerations than simple
performance of the algorithms with the training
datasets
Leave-One-Out is just one way to cross-validate
Biological interpretation