Cancer Classification with Datadependent Kernels - PowerPoint PPT Presentation

About This Presentation

Title:

Cancer Classification with Datadependent Kernels

Description:

10/23/09. DIMACS Workshop on Machine Learning Techniques in Bioinformatics. 1 ... A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 36

Provided by: dimacsR

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cancer Classification with Datadependent Kernels

1
Cancer Classification with Data-dependent Kernels

Anne Ya Zhang
(with Xue-wen Chen Huilin Xiong)
EECS ITTC
University of Kansas

2
Outline

Introduction
Data-dependent Kernel
Results
Conclusion

3
Cancer facts

Cancer is a group of many related diseases
Cells continue to grow and divide and do not die
when they should.
Changes in the genes that control normal cell
growth and death.
Cancer is the second leading cause of death in
the United States
Cancer causes 1 of every 4 deaths
NIH estimate overall costs for cancer in 2004 at
189.8 billion (64.9 billion for direct medical
cost)
Cancer types
Breast cancer, Lung cancer, Colon cancer,
Death rates vary greatly by cancer type and stage
at diagnosis

4
Motivation

Why do we need to classify cancers?
The general way of treating cancer is to
Categorize the cancers in different classes
Use specific treatment for each of the classes
Traditional way to classify cancers
Morphological appearance
Not accurate!
Enzyme-based histochemical analyses.
Immunophenotyping.
Cytogenetic analysis.
Complicated needs highly specialized
laboratories

5
Motivation

Why traditional ways are not enough ?
There exists some tumors in the same class with
completely different clinical courses
May be more accurate classification is needed
Assigning new tumors to known cancer classes is
not easy
e.g. assigning an acute leukemia tumor to one of
the
AML (acute myeloid leukemia)
ALL (acute lymphoblastic leukemia)

6
DNA Microarray-based Cancer Diagnosis

Cancer is caused by changes in the genes that
control normal cell growth and death.
Molecular diagnostics offer the promise of
precise, objective, and systematic cancer
classification
These tests are not widely applied because
characteristic molecular markers for most solid
tumors have to be identified.
Recently, microarray tumor gene expression
profiles have been used for cancer diagnosis.

7
Microarray

A microarray experiment monitors the expression
levels for thousands of genes simultaneously.
Microarray techniques will lead to a more
complete understanding of the molecular
variations among tumors, hence to a more reliable
classification.

8
Microarray

Microarray analysis allows the monitoring of the
activities of thousands of genes over many
different conditions.
From a machine learning point of view

The large volume of the data requires the
computational aid in analyzing the expression
data.
9
Machine learning tasks in cancer classification

There are three main types of machine learning
problems associated with cancer classification
The identification of new cancer classes using
gene expression profiles
The classification of cancer into known classes
The identifications of marker genes that
characterize the different cancer classes
In this presentation, we focus on the second type
of problems.

10
Project Goals

To develop a more systematic machine learning
approach to cancer classification using
microarray gene expression profiles.
Use an initial collection of samples belonging to
the known classes of cancer to create a class
predictor for new, unknown, samples.

11
Challenges in cancer classification

Gene expression data are typically characterized
by
high dimensionality (i.e. a large number of
genes)
small sample size
Curse of dimensionality!

Methods
Kernel techniques
Data resampling
Gene selection

12
Outline

Introduction
Data-dependent Kernel
Results
Conclusion

13
Data-dependent kernel model
Data dependent
Optimizing the data-dependent kernel is to choose
the coefficient vector
14
Optimizing the kernel

Criterion for kernel optimization
Maximum class separability of the training data
in the kernel-induced feature space

15
The Kernel Optimization
In reality, the matrix N0 is usually singular
a eigenvector corresponding to the largest
eigenvalue
16
Kernel optimization
Training data
Test data
Before Kernel Optimization
After Kernel Optimization
17
Distributed resampling

Original training data
Training data with resampling

18
Gene selection

A filter method class separability

19
Outline

Introduction
Data-dependent Kernel
Results
Conclusion

20
Comparison with other methods

k-Nearest Neighbor (kNN)
Diagonal linear discriminant analysis (DLDA)
Uncorrelated Linear Discriminant analysis (ULDA)
Support vector machines (SVM)

21
Data sets
AML
Subtypes ALL vs. AML
Status of Estrogen receptor
Status of lymph nodal
Outcome of treatment
Tumor vs. healthy tissue
Subtypes MPM vs. ADCA
Different lymphomas cells
Cancer vs. non-cancer
Tumor vs. healthy tissue
22
Experimental setup

Data normalization
Zero mean and unity variance at the gene
direction
Random partition data into two disjoint subsets
of equal size training data test data
Repeat each experiment 100 times

23
Parameters

DLDA no parameter
KNN Euclidean distance, K3
ULDA K3
SVM Gaussian kernel, use leave-one-out on the
training data to tune parameters
KerNN Gaussian kernel for basic kernel k0, ?0
andsare empirically set. Use leave-one-out on the
training data to tune the rest parameters. KNN
for classification

24
Effect of data resampling
Prostate 102 samples
Lung 181 samples
25
Effect of gene selection
ALL-AML
26
Effect of gene selection
Colon
27
Effect of gene selection
Prostate
28
Comparison results
BreastER
ALL-AML
BreastLN
Colon
29
Comparison results
CNS
lung
Prostate
Ovarian
30
Outline

Introduction
Data-dependent Kernel
Results
Conclusion

31
Conclusion

By maximizing the class separability of training
data, the data-dependent kernel is also able to
increase the separability of test data.
The kernel method is robust to high dimensional
microarray data
The distributed resampling strategy helps to
alleviate the problem of overfitting

32
Conclusion

The classifier assign samples more accurately
than other approaches so we can have better
treatments respectively.
The method can be used for clarifying unusual
cases
e.g. a patient which was diagnosed as AML but
with atypical morphology.
The method can be applied to distinctions
relating to future clinical outcomes.

33
Future work

How to estimate the parameters
Study the genes selected

34
Reference

H. Xiong, M.N.S. Swamy, and M.O. Ahmad.
Optimizing the data-dependent kernel in the
empirical feature space. IEEE Trans. on Neural
Networks 2005, 16460-474.
H. Xiong, Y. Zhang, and X. Chen. Data-dependent
Kernels for Cancer Classification. Under review.
A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M.
Schummer, and Z. Yakhini. Tissue classification
with gene expression profiles. J. Computational
Biology 2000, 7559-584.
S. Dudoit, J. Fridlyand, and T.P. Speed.
Comparison of discrimination method for the
classification of tumor using gene expression
data. J. Am. Statistical Assoc. 2002, 9777-87
T.S. Furey, N. Cristianini, N. Duffy, D.W.
Bednarski, M. Schummer, and D. Haussler. Support
vector machine classification and validation of
cancer tissue samples using microarray expression
data. Bioinformatics 2000, 16906-914.
J. Ye, T. Li, T. Xiong, and R. Janardan. Using
uncorrelated discriminant analysis for tissue
classification with gene expression data.
IEEE/ACM Trans. on Computational Biology and
Bioinformatics 2004, 1181-190.