Multi-Class Cancer Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Multi-Class Cancer Classification

Description:

Example: We want to be able to determine if a certain sample belongs to a ... Now, re-writing the formula from the two s back: So, by taking log, we get: ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 54
Provided by: noaml
Category:

less

Transcript and Presenter's Notes

Title: Multi-Class Cancer Classification


1
Multi-Class Cancer Classification
  • Noam Lerner

2
Our world very generally
  • Genes.
  • Gene samples.
  • Our goal classifying the samples.
  • Example We want to be able to determine if a
    certain sample belongs to a certain type of
    cancer.

3
Our problem
  • Say that we have p genes, and N samples.
  • Normally, pltN, so its easy to classify samples.
  • What if Nltp?

4
The algorithm scheme
  • Gene screening.
  • Dimension reduction.
  • Classification.
  • Well present 3 variations of this algorithm
    scheme.

5
Before gene screening - Classes
  • Normally, a class of genes is a set of genes that
    behave similarly under certain conditions.
  • Example One can divide genes into a class of
    genes that indicate a certain type of cancer, and
    to another class of genes, that do not indicate.
  • Taking it one step further

6
Multi-classes
  • Diving a group of genes into two or more classes
    is called a Multi-class.
  • What is it good for?
  • Distinguishing between types of cancer.
  • Example Leukemia
  • AML
  • B-ALL
  • T-ALL

7
Gene Screening
  • Generally, gene screening is a method that is
    used to disregard unimportant genes.
  • Example gene predictors.

8
The Gene Screening process
  • Suppose we have G classes that represent G types
    of cancer. (We know which genes belong in each
    class).
  • We compare every two classes pair-wise and see if
    the expression is greater than a
    certain critical score. ( is the mean of the
    r-th set of the multi set)

9
What is the critical score?
  • MSE mean squared error.
  • - the size of the r-th multi set.
  • t arises from students t-distribution.

10
Students t-distribution
  • t distribution is used to estimate the mean and
    variance of a normally distributed population
    when the sample size is small.
  • Fact The t distribution depends on the size of
    the population, but not on the mean nor in the
    variance of the items in the population. The lack
    of dependence is what makes the t-distribution
    important in both theory and practice.
  • Anecdote William S. Gosset published a paper on
    this subject, under the pseudonym student, and
    thats how the distribution got its name.

11
The student t test
  • The t-test assesses whether the means of two
    groups are statistically different from each
    other.
  • This analysis is appropriate whenever you want to
    compare the means of two groups.
  • It is assumed that the two groups have the same
    variance.

12
The student t test (cont.)
  • Consider the next three situations

13
The student t test (cont.)
  • The first thing to notice about the three
    situations is that the difference between the
    means is the same in all three.
  • We would want to conclude that the two groups are
    similar in the high-variability case, and the two
    groups are distinct in the low-variability case.
  • Conclusion when we are looking at the
    differences between scores for two groups, we
    have to judge the difference between their means
    relative to the spread or variability of their
    scores. The student t-test does just this.

14
The student t test (cont.)
  • We say that two classes passed the student t test
    if the t is greater than a certain parameter.
  • Risk level usually
  • Degrees of freedom
  • Look it up in a table.

15
Dimension Reduction
  • It appears that we need more than Gene Screening.
  • Reminder We have p genes, N samples, Nltp.
  • Most classification methods (the next phase of
    the algorithm) assume that pltN.
  • The solution dimension reduction reducing the
    gene space dimension from p to K where K ltlt N.

16
Dimension Reduction (cont.)
  • This is done by constructing K gene components
    and then classifying the cancers based on the
    constructed K gene components.
  • Multivariate Partial Least Squares (MPLS) is a
    dimension reduction method.
  • Example

17
Example
  • Reducing dimension from 35 to 3 (5 classes).

18
Example (cont.)
This is the NCI60 data set that contains 5
various types of cancer.
19
MPLS
  • Suppose we have G classes.
  • Suppose y indicates the cancer classes 1,,G.
  • We define a row for every sample
  • Fix a K (our desired reduction dimension)

20
MPLS (cont.)
  • Suppose X is the gene expression values matrix.
  • Suppose t1,,tK are linear combination of X.
  • Then, the MPLS finds (easily) two unit vectors w
    and c such that the following expression is
    maximized
  • Then , the MPLS extracts t1,,tK, and we are done.

21
Why maximizing the covariance?
  • If cov(x,y)gt0 then y increases as x increases.
  • If cov(x,y)lt0 then y decreases as x increases.
  • By maximizing the covariance, we get that Yc
    increases as Xw increases.
  • That way, we get a good estimation of Yc by Xw,
    and we have found our MPLS components t1,,tK.

22
Classification
  • After we have reduced the dimension of the gene
    space, we need to actually classify the
    sample(s).
  • Its important to pick a classification method
    that will work properly after dimension
    reduction.
  • Well present two different methods PD and QDA.

23
PD (Polychotomous Discrimination)
  • Recall the indicator y that indicates the cancer
    classes 1,,G.
  • Set a vector .
  • Then, the distribution of y depends on x. (We
    think of y as a random variable).
  • We also suppose that

24
PD (cont.)
  • We define
  • After a few mathematical transitions we get that
  • This is the probability that a sample with gene
    expression profile x is of cancer class r.

25
PD (cont.)
  • By looking at the previous formula through a
    certain mathematical model, we can maximize a
    parameter, that holds all the data.
  • The parameter can be maximized only if there are
    more samples (N) than parameters (p), and by
    using dimension reduction, we got just that.

26
PD (cont.)
  • So, instead of looking at
    well look at the corresponding gene component
    profile, .
  • Now, lets look at the new probabilities, that
    rely on the new .
  • Finally, well say that (and therefore )
    belong to the r-th cancer class if
  • A more detailed explanation on PD is given on the
    presentations appendix.

27
QDA (Quadratic Discriminant Analysis)
  • Recall the indicator y that indicates the cancer
    classes 1,,G.
  • Consider the following multivariate normal model
    (for each cancer class)

28
QDA (cont.)
  • Suppose is the classification of the r-th
    cancer class, then
  • Where
  • is s pdf function.

29
QDA (cont.)
  • Again, instead of looking at
  • well look at the
    corresponding gene component profile, ,
    and get the desired classification.

30
Review - the big picture
  • Gene screening allows us to get rid of genes
    that wont tell us anything.
  • Dimension reduction allows us to reduce the the
    gene space and work on the data.
  • Classification allows us to decide if a sample
    has a cancer of a certain multi-class.

31
Just before the algorithm
  • We would want a way to assess if we generated a
    correct classification.
  • In order to do that we use LOOCV.

32
LOOCV
  • LOOCV stands for Leave Out One Cross Validation.
  • In this process, we remove one data point from
    our data, run our algorithm, and try to estimate
    the removed data point using our results, as if
    we didnt know the original data point. Then, we
    assess the error.
  • This step is repeated for every data point, and
    finally, we accumulate the errors in some sort
    for a final error estimation.

33
The 1st algorithm variation
  1. Gene screening select a set S of m genes, giving
    an expression matrix X of size N x m.
  2. Dimension reduction Use MPLS to reduce X to T
    where T is of size N x K.
  3. Classification For i1 to N do
  4. Leave out sample (row) i of T,
  5. Fit the classifier to the remaining N-1 samples
    and use the fitted classifier to predict the left
    out sample i.

34
The 2nd algorithm variation
  • Gene screening select a set S of m genes, giving
    an expression matrix X of size N x m.
  • For i1 to N do
  • Leave out sample (row) i of the expression matrix
    X creating X-i
  • Dimension reduction Use MPLS to reduce X-i to
    T-i where T-i is of size N x K.
  • Classification Fit the classifier to the
    remaining N-1 samples and use the fitted
    classifier to predict the left out sample i.

35
Class question
  • Q What is the difference between the 1st and 2nd
    variations?
  • A1 In the 1st variation, steps 1 and 2 are fixed
    with respect to LOOCV. Therefore, the effect of
    gene screening and dimension reduction on the
    classification cannot be assessed.
  • A2 In the 2nd variation, we can assess the
    effect of the dimension reduction.

36
More on the 1st variation
  • Results show that the 1st variation does not
    yield good results. (The classification error
    rates were more optimistic than the expected
    error rates.
  • Taking it to the next level

37
The 3rd algorithm variation
  • For i1 to N do
  • Leave out sample (row) i of the original
    expression matrix X0.
  • Gene screening select a set S-i of m genes,
    giving an expression matrix X-i of size N-1 x m.
  • Dimension reduction Use MPLS to reduce X-i to
    T-i where T-i is of size N-1 x K.
  • Classification Fit the classifier to the
    remaining N-1 samples and use the fitted
    classifier to predict the left out sample i.

38
Class question
  • Q What is the difference between the 2nd and
    3rd variations?
  • A The gene screening stage is fixed with respect
    to LOOCV in the 2nd variation, and isnt in the
    3rd variation.
  • That allows us to assess the error in the gene
    screening stage in the 3rd variation.

39
About the 3 variations
  • The 3rd variation is the only one that allows us
    to check the correctness of out model.
  • Why?
  • Because this is the only variation where we use
    LOOCV to delete a sample from our input matrix,
    and then try to estimate it.
  • In the other two variations we estimate a
    sample after we used it in our process.

40
Results
  • Acute Leukemia Data
  • Number of samples N 72.
  • Number of genes p 3490.
  • The multi-class
  • AML 25 samples.
  • B-ALL 38 samples.
  • T-ALL 9 samples.
  • New reduced dimension K 3.

41
Results (cont.)
Notations Numbers in brackets the number of
times we demanded that the pairwise absolute mean
difference will pass the critical score. Numbers
not in brackets the number of genes that
passed. In A2 the three numbers are the
min-mean-max number of genes selected. (The Gene
screening process selects differently every
time) Data the error rate. Best result QDA.
42
Article Criticism
  • The article does present a model that seems to be
    appropriate to solve the problem.
  • However, results show that there is a certain
    error rate. (About 1/20).
  • The article was not clear on several subjects.
  • Non the less, it was interesting to read.

43
Questions?
  • Thank you for listening.

44
References
  • The article Multi-class cancer Classification
    via partial least squares with gene expression
    profiles by Danh V. Nguyen and David M. Rocke.
  • Students t distribution http//en.wikipedia.org
    /wiki/T_distribution
  • Student t test http//www.socialresearchmethods.n
    et/kb/stat_t.htm
  • LOOCV http//www-2.cs.cmu.edu/schneide/tut5/node
    42.html

45
Appendix - Polychotomous Discrimination
explicit explanation.
  • Why we define?
  • To avoid calculating
  • Explanation Remember that
  • So
  • So we dont have to calculate

46
PD
  • We assume we can write
  • Remembering that we can get to
  • This is our polychotomous regression model.
  • Next, we assign beta to that formula (Replacing
  • with ).

47
PD
  • Next, we define
  • This holds our whole model.
  • Now we want to maximize beta using MLE Maximum
    Likelihood Estimation.
  • Well describe how to do that.

48
PD
  • Defining a notation
  • Now, re-writing the formula from the two slides
    back
  • So, by taking log, we get
  • Next, define a row of indicators for a sample
  • Where and
  • Where states if the sample belongs to a
    type cancer

49
PD
  • Now, Define a matrix
  • Notice that Meaning that in every row of
    , the sample was classified to exactly one
    cancer class.
  • Using these notations, we conclude that the
    likelihood for N independent samples is

50
PD
  • Taking log, we get the log-likelihood (which is
    easier to compute).

51
PD
  • Next, remembering that we get that
  • Now, this expression can be maximized to achieve
    the MLE using the Newton-Raphson method.
  • One of the cases that the MLE exists is if there
    exists a vector such
    that where index set identifying all
    samples in class r.

52
Appendix References
  • Article appendices http//dnguyen.ucdavis.edu/.ht
    ml/SUP_cla2/SupplementalAppendix.pdf
  • Newton Raphson method - http//en.wikipedia.org/wi
    ki/Newton-Raphson_method
  • On the Existence of Maximum Likelihood Estimates
    in Logistic Regression Models. (A. Albert J. A.
    Anderson 1984). http//www.qiji.cn/eprint/abs/2376
    .html

53
AbstractThis presentation deals with multi-class
cancer classification The process of
classifying samples into multiple types of
cancer. The article describes a 3-phase
algorithm scheme to demonstrate the
classification. The 3 phases are Gene Selection,
Dimension reduction and Classification. We
present one example of gene selection method, one
example of a dimension reduction method (MPLS),
and two classification methods (PD and QDA),
which we then compare between.The presentation
also presents concepts like class, multi-class,
t-test, and LOOCV.
Write a Comment
User Comments (0)
About PowerShow.com