Title: Cancer Classification with Datadependent Kernels
1Cancer Classification with Data-dependent Kernels
- Anne Ya Zhang
- (with Xue-wen Chen Huilin Xiong)
- EECS ITTC
- University of Kansas
2Outline
- Introduction
- Data-dependent Kernel
- Results
- Conclusion
3Cancer facts
- Cancer is a group of many related diseases
- Cells continue to grow and divide and do not die
when they should. - Changes in the genes that control normal cell
growth and death. - Cancer is the second leading cause of death in
the United States - Cancer causes 1 of every 4 deaths
- NIH estimate overall costs for cancer in 2004 at
189.8 billion (64.9 billion for direct medical
cost) - Cancer types
- Breast cancer, Lung cancer, Colon cancer,
- Death rates vary greatly by cancer type and stage
at diagnosis
4Motivation
- Why do we need to classify cancers?
- The general way of treating cancer is to
- Categorize the cancers in different classes
- Use specific treatment for each of the classes
- Traditional way to classify cancers
- Morphological appearance
- Not accurate!
- Enzyme-based histochemical analyses.
- Immunophenotyping.
- Cytogenetic analysis.
- Complicated needs highly specialized
laboratories
5Motivation
- Why traditional ways are not enough ?
- There exists some tumors in the same class with
completely different clinical courses - May be more accurate classification is needed
- Assigning new tumors to known cancer classes is
not easy - e.g. assigning an acute leukemia tumor to one of
the - AML (acute myeloid leukemia)
- ALL (acute lymphoblastic leukemia)
6DNA Microarray-based Cancer Diagnosis
- Cancer is caused by changes in the genes that
control normal cell growth and death. - Molecular diagnostics offer the promise of
precise, objective, and systematic cancer
classification - These tests are not widely applied because
characteristic molecular markers for most solid
tumors have to be identified. - Recently, microarray tumor gene expression
profiles have been used for cancer diagnosis.
7Microarray
- A microarray experiment monitors the expression
levels for thousands of genes simultaneously. - Microarray techniques will lead to a more
complete understanding of the molecular
variations among tumors, hence to a more reliable
classification.
8Microarray
- Microarray analysis allows the monitoring of the
activities of thousands of genes over many
different conditions. - From a machine learning point of view
The large volume of the data requires the
computational aid in analyzing the expression
data.
9Machine learning tasks in cancer classification
- There are three main types of machine learning
problems associated with cancer classification - The identification of new cancer classes using
gene expression profiles - The classification of cancer into known classes
- The identifications of marker genes that
characterize the different cancer classes - In this presentation, we focus on the second type
of problems.
10Project Goals
- To develop a more systematic machine learning
approach to cancer classification using
microarray gene expression profiles. - Use an initial collection of samples belonging to
the known classes of cancer to create a class
predictor for new, unknown, samples.
11Challenges in cancer classification
- Gene expression data are typically characterized
by - high dimensionality (i.e. a large number of
genes) - small sample size
- Curse of dimensionality!
- Methods
- Kernel techniques
- Data resampling
- Gene selection
12Outline
- Introduction
- Data-dependent Kernel
- Results
- Conclusion
13Data-dependent kernel model
Data dependent
Optimizing the data-dependent kernel is to choose
the coefficient vector
14Optimizing the kernel
- Criterion for kernel optimization
- Maximum class separability of the training data
in the kernel-induced feature space
15The Kernel Optimization
In reality, the matrix N0 is usually singular
a eigenvector corresponding to the largest
eigenvalue
16Kernel optimization
Training data
Test data
Before Kernel Optimization
After Kernel Optimization
17Distributed resampling
- Original training data
- Training data with resampling
18Gene selection
- A filter method class separability
19Outline
- Introduction
- Data-dependent Kernel
- Results
- Conclusion
20Comparison with other methods
- k-Nearest Neighbor (kNN)
- Diagonal linear discriminant analysis (DLDA)
- Uncorrelated Linear Discriminant analysis (ULDA)
- Support vector machines (SVM)
21Data sets
AML
Subtypes ALL vs. AML
Status of Estrogen receptor
Status of lymph nodal
Outcome of treatment
Tumor vs. healthy tissue
Subtypes MPM vs. ADCA
Different lymphomas cells
Cancer vs. non-cancer
Tumor vs. healthy tissue
22Experimental setup
- Data normalization
- Zero mean and unity variance at the gene
direction - Random partition data into two disjoint subsets
of equal size training data test data - Repeat each experiment 100 times
23Parameters
- DLDA no parameter
- KNN Euclidean distance, K3
- ULDA K3
- SVM Gaussian kernel, use leave-one-out on the
training data to tune parameters - KerNN Gaussian kernel for basic kernel k0, ?0
andsare empirically set. Use leave-one-out on the
training data to tune the rest parameters. KNN
for classification
24Effect of data resampling
Prostate 102 samples
Lung 181 samples
25Effect of gene selection
ALL-AML
26Effect of gene selection
Colon
27Effect of gene selection
Prostate
28Comparison results
BreastER
ALL-AML
BreastLN
Colon
29Comparison results
CNS
lung
Prostate
Ovarian
30Outline
- Introduction
- Data-dependent Kernel
- Results
- Conclusion
31Conclusion
- By maximizing the class separability of training
data, the data-dependent kernel is also able to
increase the separability of test data. - The kernel method is robust to high dimensional
microarray data - The distributed resampling strategy helps to
alleviate the problem of overfitting
32Conclusion
- The classifier assign samples more accurately
than other approaches so we can have better
treatments respectively. - The method can be used for clarifying unusual
cases - e.g. a patient which was diagnosed as AML but
with atypical morphology. - The method can be applied to distinctions
relating to future clinical outcomes.
33Future work
- How to estimate the parameters
- Study the genes selected
34Reference
- H. Xiong, M.N.S. Swamy, and M.O. Ahmad.
Optimizing the data-dependent kernel in the
empirical feature space. IEEE Trans. on Neural
Networks 2005, 16460-474. - H. Xiong, Y. Zhang, and X. Chen. Data-dependent
Kernels for Cancer Classification. Under review. - A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M.
Schummer, and Z. Yakhini. Tissue classification
with gene expression profiles. J. Computational
Biology 2000, 7559-584. - S. Dudoit, J. Fridlyand, and T.P. Speed.
Comparison of discrimination method for the
classification of tumor using gene expression
data. J. Am. Statistical Assoc. 2002, 9777-87 - T.S. Furey, N. Cristianini, N. Duffy, D.W.
Bednarski, M. Schummer, and D. Haussler. Support
vector machine classification and validation of
cancer tissue samples using microarray expression
data. Bioinformatics 2000, 16906-914. - J. Ye, T. Li, T. Xiong, and R. Janardan. Using
uncorrelated discriminant analysis for tissue
classification with gene expression data.
IEEE/ACM Trans. on Computational Biology and
Bioinformatics 2004, 1181-190.
35Thanks!Questions?