Title: Across Platform and Multiple Dataset Classification
1Across Platform and Multiple Dataset Molecular
Classification Using Relative Features and Large
Bayes
Across Platform and Multiple Dataset
Classification Across platform classification.-
Training a classifier in a dataset obtained by
using one technology platform (e.g.
oligonucleotide microarrays) and applies this
model to predict samples from a test set obtained
by a different technology (e.g. cDNA
microarrays). This is useful for example to
validate a model or to develop a centrally
trained universal model to be used and deployed
on different, not already existing,
datasets. Multiple dataset classification.-
Combining several datasets potentially
representing different platforms or source
material and build a global unified
classification model. This model benefits from an
increase sample size and also from the potential
richness of the combined dataset. This is
important for example in the context of building
a global classification model on top of a
database of expression data.
P. Tamayo, T. Golub, J. Mesirov, N. Patterson,
J-P. Brunet, S. Monti. Whitehead Institute/MIT
Center for Genome Research D. Meretakis, H. Lu
and B. Wüthrich. Technical University of Hong Kong
Introduction The widespread use of microarrays,
the refinement of protocols and the relative
success of molecular classification has produced
a significant increase in the number of publicly
available gene expression datasets. A potential
benefit of this is the large number of samples
for analysis and more comprehensive
representation of disease phenotypes. At the same
time there is a significant technical challenge
in how to deal with the associated variability
coming from the use of different technologies,
platforms and sources of material. In this
context there are two important situations of
special relevance Across platform and combined
multiple dataset classification. In this poster
we summarize the results of a new methodology
that addresses these problems based on combining
the Large Bayes classification framework with the
definition of relative pair gene features. One of
the assumptions of this approach is that
different datasets representing the same
biological system display some amount of
invariant biological characteristics
independent of the idiosyncrasies of sample
sources, preparation and the technological
platform used to obtain the measurements. These
invariant biological characteristics, when
properly captured and exposed, can provide the
basis to build more robust, general and accurate
classification models based on reproducible
biologically behavior and less vulnerable to
idiosyncrasies and technological details. We
present results for a across-platform
classification of Lymphoma morphology where the
training of the model is done on oligonucleotide
(Affymetrix Hu6800) and the testing on cDNA
microarrays. We also present results for a
combined 4-class adenocarcinoma datasets
incorporating 440 samples from six different
original datasets using three platforms
(oligonucleotide, cDNA and inkjet microarrays).
Despite the different technologies, sample
sources and the reduced overlapping feature sets,
the presented methodology allows for the
construction of a global Large Bayes model
attaining 94 accuracy. This demonstrates the
feasibility of building accurate classifiers
based on large combined datasets of gene
expression data and opens the way for a practical
method to build global classification models that
exploit entire databases of gene expression data.
Large Bayes Classification
Relative Gene Pair Features
Multiple (6) Dataset Classification (4-Class
Adenocarcinoma)
Across Platform Classification (Lymphoma
Morphology)
- An improvement over the Naive Bayes classifier
- Introduced by Dimitris Meretakis, Hongjun Lu and
Beat Wüthrich in 1999-2000 (Technical Univ. of
Hong Kong). - A classifier built from labeled frequent
itemsets. - It tolerates missing features and missing
values. - It combines unsupervised and supervised
approaches. - It work with small number of data points and
large number of dimensions and tolerates missing
values or features. - Learning phase use apriori-like method to
discover frequent labeled itemsets.. - Classification phase given a new case Aa1,
a2,..., an, estimate P(ciA) for each class ci
and choose the most probable class.
Probabilistically combine the stored itemsets for
the estimation - e.g. P(a1a2a3a4a5ci) P(a1a2a3ci)P(a4a2ci)P(a5a
3ci) - Large Bayes reduces to Naïve Bayes when only
one-item itemsets are used - e.g. P(a1a2a3a4a5ci) P(a1ci)P(a2ci)P(a3ci)P(a4
ci)P(a5ci)
- Define features (Fk) based on comparing the gene
expression values of gene pairs (f1 gt f2 ). For
example - Fk 1 if f1 gt f2
- -1 if f1 lt f2
- If this is repeated for many genes we can
generate a set of relative features that
represent gene relationships. - Original features fi Relationship
Relative feature Fk - gene 1900, gene 2500 gene 1 gt gene 2
1 - gene 350, gene 4800 gene 3 lt gene 4 -1
- gene 5300, gene 610 gene 5 gt gene 6
1 - gene 2500, gene 350 gene 2 gt gene 3 1
-
- These features
- Capture gene-to-gene relationships regardless of
the precise absolute values of gene expression or
the existence of other genes in the feature set. - Provide a simple first level abstraction of gene
relationships. - Do not preserve all the information contained in
the original gene expression values. - They can be used as markers to classification
build models across a diverse set of datasets.
Lymphoma morphology subclasses Large B-Cell
(DLBC) vs. Follicular. Dataset 1
(oligonucleotide) Affymetrix Hu6800, 38
samples, 7129 genes Dataset 2 (cDNA)
Stanford cDNA, 18 samples, 1635
genes Large Bayes model with 50 combined
features, itemset length 3 Type Mode
Accuracy Cross-validation on dataset
1 0.95 Cross-validation on dataset
2 1.0 Training on dataset 2, testing on
dataset 1 0.84 Training on dataset 1,
testing on dataset 2 0.83
2-Dataset Classification (Lymphoma Morphology)
Combining the Lymphoma datasets 1 and 2
Cross-Validation Large Bayes Model Accuracy
0.93 (52/56)
Relative Features and Large Bayes Methodology The
methodology combines the use of relative gene
pair features with the Large Bayes
Example of 2-Item Itemset from the combined
dataset Itemset 371 L 2 Class supports 0 1
out of 28.000 (0.036) 0.018 1 27 out of
28.000 (0.964) 0.482 Total support 28 out
of 56 (0.500) X82240 TCL1 gene (T cell
leukemia) gt M25753 G2/MITOTIC-SPECIFIC CYCLIN B1
AND X52425 IL4R Interleukin 4 receptor gt
M97936 SIGNAL TRANS. AND ACTIV. OF
TRANS.
Itemsets
Dataset 1
Dataset 2
Whitehead / Affymetrix Hu6800
Stanford cDNA Large B-Cell
Follicular Large
B-Cell Follicular