Cancer Classification with Datadependent Kernels - PowerPoint PPT Presentation

About This Presentation
Title:

Cancer Classification with Datadependent Kernels

Description:

10/23/09. DIMACS Workshop on Machine Learning Techniques in Bioinformatics. 1 ... A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 36
Provided by: dimacsR
Category:

less

Transcript and Presenter's Notes

Title: Cancer Classification with Datadependent Kernels


1
Cancer Classification with Data-dependent Kernels
  • Anne Ya Zhang
  • (with Xue-wen Chen Huilin Xiong)
  • EECS ITTC
  • University of Kansas

2
Outline
  • Introduction
  • Data-dependent Kernel
  • Results
  • Conclusion

3
Cancer facts
  • Cancer is a group of many related diseases
  • Cells continue to grow and divide and do not die
    when they should.
  • Changes in the genes that control normal cell
    growth and death.
  • Cancer is the second leading cause of death in
    the United States
  • Cancer causes 1 of every 4 deaths
  • NIH estimate overall costs for cancer in 2004 at
    189.8 billion (64.9 billion for direct medical
    cost)
  • Cancer types
  • Breast cancer, Lung cancer, Colon cancer,
  • Death rates vary greatly by cancer type and stage
    at diagnosis

4
Motivation
  • Why do we need to classify cancers?
  • The general way of treating cancer is to
  • Categorize the cancers in different classes
  • Use specific treatment for each of the classes
  • Traditional way to classify cancers
  • Morphological appearance
  • Not accurate!
  • Enzyme-based histochemical analyses.
  • Immunophenotyping.
  • Cytogenetic analysis.
  • Complicated needs highly specialized
    laboratories

5
Motivation
  • Why traditional ways are not enough ?
  • There exists some tumors in the same class with
    completely different clinical courses
  • May be more accurate classification is needed
  • Assigning new tumors to known cancer classes is
    not easy
  • e.g. assigning an acute leukemia tumor to one of
    the
  • AML (acute myeloid leukemia)
  • ALL (acute lymphoblastic leukemia)

6
DNA Microarray-based Cancer Diagnosis
  • Cancer is caused by changes in the genes that
    control normal cell growth and death.
  • Molecular diagnostics offer the promise of
    precise, objective, and systematic cancer
    classification
  • These tests are not widely applied because
    characteristic molecular markers for most solid
    tumors have to be identified.
  • Recently, microarray tumor gene expression
    profiles have been used for cancer diagnosis.

7
Microarray
  • A microarray experiment monitors the expression
    levels for thousands of genes simultaneously.
  • Microarray techniques will lead to a more
    complete understanding of the molecular
    variations among tumors, hence to a more reliable
    classification.

8
Microarray
  • Microarray analysis allows the monitoring of the
    activities of thousands of genes over many
    different conditions.
  • From a machine learning point of view

The large volume of the data requires the
computational aid in analyzing the expression
data.
9
Machine learning tasks in cancer classification
  • There are three main types of machine learning
    problems associated with cancer classification
  • The identification of new cancer classes using
    gene expression profiles
  • The classification of cancer into known classes
  • The identifications of marker genes that
    characterize the different cancer classes
  • In this presentation, we focus on the second type
    of problems.

10
Project Goals
  • To develop a more systematic machine learning
    approach to cancer classification using
    microarray gene expression profiles.
  • Use an initial collection of samples belonging to
    the known classes of cancer to create a class
    predictor for new, unknown, samples.

11
Challenges in cancer classification
  • Gene expression data are typically characterized
    by
  • high dimensionality (i.e. a large number of
    genes)
  • small sample size
  • Curse of dimensionality!
  • Methods
  • Kernel techniques
  • Data resampling
  • Gene selection

12
Outline
  • Introduction
  • Data-dependent Kernel
  • Results
  • Conclusion

13
Data-dependent kernel model
Data dependent
Optimizing the data-dependent kernel is to choose
the coefficient vector
14
Optimizing the kernel
  • Criterion for kernel optimization
  • Maximum class separability of the training data
    in the kernel-induced feature space

15
The Kernel Optimization
In reality, the matrix N0 is usually singular
a eigenvector corresponding to the largest
eigenvalue
16
Kernel optimization
Training data
Test data
Before Kernel Optimization
After Kernel Optimization
17
Distributed resampling
  • Original training data
  • Training data with resampling

18
Gene selection
  • A filter method class separability

19
Outline
  • Introduction
  • Data-dependent Kernel
  • Results
  • Conclusion

20
Comparison with other methods
  • k-Nearest Neighbor (kNN)
  • Diagonal linear discriminant analysis (DLDA)
  • Uncorrelated Linear Discriminant analysis (ULDA)
  • Support vector machines (SVM)

21
Data sets
AML
Subtypes ALL vs. AML
Status of Estrogen receptor
Status of lymph nodal
Outcome of treatment
Tumor vs. healthy tissue
Subtypes MPM vs. ADCA
Different lymphomas cells
Cancer vs. non-cancer
Tumor vs. healthy tissue
22
Experimental setup
  • Data normalization
  • Zero mean and unity variance at the gene
    direction
  • Random partition data into two disjoint subsets
    of equal size training data test data
  • Repeat each experiment 100 times

23
Parameters
  • DLDA no parameter
  • KNN Euclidean distance, K3
  • ULDA K3
  • SVM Gaussian kernel, use leave-one-out on the
    training data to tune parameters
  • KerNN Gaussian kernel for basic kernel k0, ?0
    andsare empirically set. Use leave-one-out on the
    training data to tune the rest parameters. KNN
    for classification

24
Effect of data resampling
Prostate 102 samples
Lung 181 samples
25
Effect of gene selection
ALL-AML
26
Effect of gene selection
Colon
27
Effect of gene selection
Prostate
28
Comparison results
BreastER
ALL-AML
BreastLN
Colon
29
Comparison results
CNS
lung
Prostate
Ovarian
30
Outline
  • Introduction
  • Data-dependent Kernel
  • Results
  • Conclusion

31
Conclusion
  • By maximizing the class separability of training
    data, the data-dependent kernel is also able to
    increase the separability of test data.
  • The kernel method is robust to high dimensional
    microarray data
  • The distributed resampling strategy helps to
    alleviate the problem of overfitting

32
Conclusion
  • The classifier assign samples more accurately
    than other approaches so we can have better
    treatments respectively.
  • The method can be used for clarifying unusual
    cases
  • e.g. a patient which was diagnosed as AML but
    with atypical morphology.
  • The method can be applied to distinctions
    relating to future clinical outcomes.

33
Future work
  • How to estimate the parameters
  • Study the genes selected

34
Reference
  • H. Xiong, M.N.S. Swamy, and M.O. Ahmad.
    Optimizing the data-dependent kernel in the
    empirical feature space. IEEE Trans. on Neural
    Networks 2005, 16460-474.
  • H. Xiong, Y. Zhang, and X. Chen. Data-dependent
    Kernels for Cancer Classification. Under review.
  • A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M.
    Schummer, and Z. Yakhini. Tissue classification
    with gene expression profiles. J. Computational
    Biology 2000, 7559-584.
  • S. Dudoit, J. Fridlyand, and T.P. Speed.
    Comparison of discrimination method for the
    classification of tumor using gene expression
    data. J. Am. Statistical Assoc. 2002, 9777-87
  • T.S. Furey, N. Cristianini, N. Duffy, D.W.
    Bednarski, M. Schummer, and D. Haussler. Support
    vector machine classification and validation of
    cancer tissue samples using microarray expression
    data. Bioinformatics 2000, 16906-914.
  • J. Ye, T. Li, T. Xiong, and R. Janardan. Using
    uncorrelated discriminant analysis for tissue
    classification with gene expression data.
    IEEE/ACM Trans. on Computational Biology and
    Bioinformatics 2004, 1181-190.

35
Thanks!Questions?
Write a Comment
User Comments (0)
About PowerShow.com