Title: Boosting For Tumor Classification With Gene Expression Data
1Boosting For Tumor Classification With Gene
Expression Data
- Marcel Dettling and Peter Buhlmann
- Bioinformatics, Vol.19, no.9 2003, 1061-1069
Hojung Cho Topics for Bioinformatics Oct 10
2006
2Outline
- Background
- Microarrays
- Scoring Algorithm
- Decision Stumps
- Boosting (primer)
- Methods
- Feature preselection
- LogitBoost
- Choice of the stopping parameters
- Multiclass approach
- Results
- Data preprocessing
- Error rates
- ROC curves
- Validation of the results
- Simulation
- Discussion
3Microarray data (pn)
Park et al, 2001
Boosting decision trees applied for the
classification of gene expression data (Ben-Dor
et al (2000), Dudoit et al.(2002)) AdaBoost did
not yield good results compared to other
classifiers
4- The objective
- improve the performance of boosting for
classification of gene expression data by
modifying the algorithm - The strategies
- Feature preselection nonparametric scoring
method - Binary LogitBoost with decision stumps
- Multiclass approach reduce to multiple binary
classification
5Background Scoring Algorithm
- Nonparametric
- Allows data analysis w/o assuming underlying
distribution - Feature preselection
- Score each gene according to its strength for
phenotype discrimination - Consider only genes differentially expressed
across samples
6 Scoring Algorithm
- Sort expression levels
- Group membership determined by resulting sequence
of 0s and 1s - Correspondence between expression levels and
group membership determined by how well clustered
together - Score defined as smallest
- number of swaps of
- consecutive digits necessary
- to arrive at a perfect splitting
- (this example score 4)
7Background Scoring Algorithm
- Allows ordering of genes according to their
potential significance - Captures to what extent a gene discriminates the
response categories - Both the near zero and the maximum score n0n1
indicate a differentially expressed gene - Quality measure
- restrict boosting classifier to work with this
subset
8Background Decision Stumps
Test attribute
- Decision trees with only a single split
- The presence or absence of a single term as a
predicate - predict basketball player if and only if height
gt 2m - Base learner for boosting procedure
- Weak learner
- Subroutine returns a hypothesis given some finite
training set - Performs slightly better than random choice may
be enhanced combined with the feature preselection
Label A
Label B
9Background Boosting
- AdaBoost does not yield very impressive results
- LogitBoost
- Relies on the binomial log-likelihood as a loss
function - Found to have a slight edge over AdaBoost in many
classification problems - Usually performs better on noisy data or when
there are misspecifications or inhomogeneities of
the class labels in the training data (common in
microarray)
10Methods
- Rank and Select Features
- selection of genes with the highest values of
the quality measure - the number of p can be determined by CV.
- Train LogitBoost classification
- using decision stumps
- as weak learners
- Choice of the stopping parameter
- Leave-One-Out Cross-Validation find m for
maximizing ( l(m) )
11LogitBoost Algorithm
- p(x) PY1Xx , p(0)(x) ½ , F(0)(x) 0
12Reducing multiclass to binary
- Multiclass -gt Multibinary
- Match each class against all the other classes
(one-against-all) - Combine binary probabilities to multiclass
probabilities
probability estimates for Y j via normalization,
Plugged into the Bayes classifier,
13Sample data sets
Data preprocessing Leukemia, NCI thresholding
(floor 100, ceiling 16000), filtering (fold
change gt 5, max min gt 500), log transform,
normalization Colon log transform,
normalization Estrogen and Nodal thresholding,
log transform, normalization Lymphoma
normalization
14Results Error Rates (1) The test error using
symmetrically equal misclassification costs
15Results Error Rates (2)
16ResultsNo. of Iterations Performance
The choice of the stopping parameter for boosting
is not very critical in all six
datasets. stopping after a large, but arbitrary
number of 100 iterations is a reasonable strategy
in the microarray data
17Results ROC curves
- The test error using asymmetric misclassification
costs - Both boosting classifiers yield the similar curve
to the ideal ROC curve (red line) than the one
from classification trees. Boosting has an
advantage for small negative rates
18Validation of the results
19Simulation
20Conclusion
- Feature preselection generally improved the
predictive power - Slightly better performance of LogitBoost over
AdaBoost - Reducing multiclass problems to multiple binary
problems yielded more accurate results
21Discussion
- Marginal and far from significant edge of
LogitBoost over AdaBoost - Did feature preselection really improve the
performance? - Manipulated to make LogitBoost perform better?
- Cross-validation of algorithms with published
data - Authors have other considerations than simple
performance of the algorithms with the training
datasets - Leave-One-Out is just one way to cross-validate
- Biological interpretation