Title: Multi-Class Cancer Classification
1Multi-Class Cancer Classification
2Our world very generally
- Genes.
- Gene samples.
- Our goal classifying the samples.
- Example We want to be able to determine if a
certain sample belongs to a certain type of
cancer.
3Our problem
- Say that we have p genes, and N samples.
- Normally, pltN, so its easy to classify samples.
- What if Nltp?
4The algorithm scheme
- Gene screening.
- Dimension reduction.
- Classification.
- Well present 3 variations of this algorithm
scheme.
5Before gene screening - Classes
- Normally, a class of genes is a set of genes that
behave similarly under certain conditions. - Example One can divide genes into a class of
genes that indicate a certain type of cancer, and
to another class of genes, that do not indicate. - Taking it one step further
6Multi-classes
- Diving a group of genes into two or more classes
is called a Multi-class. - What is it good for?
- Distinguishing between types of cancer.
- Example Leukemia
- AML
- B-ALL
- T-ALL
7Gene Screening
- Generally, gene screening is a method that is
used to disregard unimportant genes. - Example gene predictors.
8The Gene Screening process
- Suppose we have G classes that represent G types
of cancer. (We know which genes belong in each
class). - We compare every two classes pair-wise and see if
the expression is greater than a
certain critical score. ( is the mean of the
r-th set of the multi set)
9What is the critical score?
- MSE mean squared error.
- - the size of the r-th multi set.
- t arises from students t-distribution.
10Students t-distribution
- t distribution is used to estimate the mean and
variance of a normally distributed population
when the sample size is small. - Fact The t distribution depends on the size of
the population, but not on the mean nor in the
variance of the items in the population. The lack
of dependence is what makes the t-distribution
important in both theory and practice. - Anecdote William S. Gosset published a paper on
this subject, under the pseudonym student, and
thats how the distribution got its name.
11The student t test
- The t-test assesses whether the means of two
groups are statistically different from each
other. - This analysis is appropriate whenever you want to
compare the means of two groups. - It is assumed that the two groups have the same
variance.
12The student t test (cont.)
- Consider the next three situations
13The student t test (cont.)
- The first thing to notice about the three
situations is that the difference between the
means is the same in all three. - We would want to conclude that the two groups are
similar in the high-variability case, and the two
groups are distinct in the low-variability case. - Conclusion when we are looking at the
differences between scores for two groups, we
have to judge the difference between their means
relative to the spread or variability of their
scores. The student t-test does just this.
14The student t test (cont.)
- We say that two classes passed the student t test
if the t is greater than a certain parameter. - Risk level usually
- Degrees of freedom
- Look it up in a table.
15Dimension Reduction
- It appears that we need more than Gene Screening.
- Reminder We have p genes, N samples, Nltp.
- Most classification methods (the next phase of
the algorithm) assume that pltN. - The solution dimension reduction reducing the
gene space dimension from p to K where K ltlt N.
16Dimension Reduction (cont.)
- This is done by constructing K gene components
and then classifying the cancers based on the
constructed K gene components. - Multivariate Partial Least Squares (MPLS) is a
dimension reduction method. - Example
17Example
- Reducing dimension from 35 to 3 (5 classes).
18Example (cont.)
This is the NCI60 data set that contains 5
various types of cancer.
19MPLS
- Suppose we have G classes.
- Suppose y indicates the cancer classes 1,,G.
- We define a row for every sample
- Fix a K (our desired reduction dimension)
20MPLS (cont.)
- Suppose X is the gene expression values matrix.
- Suppose t1,,tK are linear combination of X.
- Then, the MPLS finds (easily) two unit vectors w
and c such that the following expression is
maximized - Then , the MPLS extracts t1,,tK, and we are done.
21Why maximizing the covariance?
- If cov(x,y)gt0 then y increases as x increases.
- If cov(x,y)lt0 then y decreases as x increases.
- By maximizing the covariance, we get that Yc
increases as Xw increases. - That way, we get a good estimation of Yc by Xw,
and we have found our MPLS components t1,,tK.
22Classification
- After we have reduced the dimension of the gene
space, we need to actually classify the
sample(s). - Its important to pick a classification method
that will work properly after dimension
reduction. - Well present two different methods PD and QDA.
23PD (Polychotomous Discrimination)
- Recall the indicator y that indicates the cancer
classes 1,,G. - Set a vector .
- Then, the distribution of y depends on x. (We
think of y as a random variable). - We also suppose that
24PD (cont.)
- We define
- After a few mathematical transitions we get that
- This is the probability that a sample with gene
expression profile x is of cancer class r.
25PD (cont.)
- By looking at the previous formula through a
certain mathematical model, we can maximize a
parameter, that holds all the data. - The parameter can be maximized only if there are
more samples (N) than parameters (p), and by
using dimension reduction, we got just that.
26PD (cont.)
- So, instead of looking at
well look at the corresponding gene component
profile, . - Now, lets look at the new probabilities, that
rely on the new . - Finally, well say that (and therefore )
belong to the r-th cancer class if - A more detailed explanation on PD is given on the
presentations appendix.
27QDA (Quadratic Discriminant Analysis)
- Recall the indicator y that indicates the cancer
classes 1,,G. - Consider the following multivariate normal model
(for each cancer class)
28QDA (cont.)
- Suppose is the classification of the r-th
cancer class, then - Where
-
- is s pdf function.
29QDA (cont.)
- Again, instead of looking at
- well look at the
corresponding gene component profile, ,
and get the desired classification.
30Review - the big picture
- Gene screening allows us to get rid of genes
that wont tell us anything. - Dimension reduction allows us to reduce the the
gene space and work on the data. - Classification allows us to decide if a sample
has a cancer of a certain multi-class.
31Just before the algorithm
- We would want a way to assess if we generated a
correct classification. - In order to do that we use LOOCV.
32LOOCV
- LOOCV stands for Leave Out One Cross Validation.
- In this process, we remove one data point from
our data, run our algorithm, and try to estimate
the removed data point using our results, as if
we didnt know the original data point. Then, we
assess the error. - This step is repeated for every data point, and
finally, we accumulate the errors in some sort
for a final error estimation.
33The 1st algorithm variation
- Gene screening select a set S of m genes, giving
an expression matrix X of size N x m. - Dimension reduction Use MPLS to reduce X to T
where T is of size N x K. - Classification For i1 to N do
- Leave out sample (row) i of T,
- Fit the classifier to the remaining N-1 samples
and use the fitted classifier to predict the left
out sample i.
34The 2nd algorithm variation
- Gene screening select a set S of m genes, giving
an expression matrix X of size N x m. - For i1 to N do
- Leave out sample (row) i of the expression matrix
X creating X-i - Dimension reduction Use MPLS to reduce X-i to
T-i where T-i is of size N x K. - Classification Fit the classifier to the
remaining N-1 samples and use the fitted
classifier to predict the left out sample i.
35Class question
- Q What is the difference between the 1st and 2nd
variations? - A1 In the 1st variation, steps 1 and 2 are fixed
with respect to LOOCV. Therefore, the effect of
gene screening and dimension reduction on the
classification cannot be assessed. - A2 In the 2nd variation, we can assess the
effect of the dimension reduction.
36More on the 1st variation
- Results show that the 1st variation does not
yield good results. (The classification error
rates were more optimistic than the expected
error rates. - Taking it to the next level
37The 3rd algorithm variation
- For i1 to N do
- Leave out sample (row) i of the original
expression matrix X0. - Gene screening select a set S-i of m genes,
giving an expression matrix X-i of size N-1 x m. - Dimension reduction Use MPLS to reduce X-i to
T-i where T-i is of size N-1 x K. - Classification Fit the classifier to the
remaining N-1 samples and use the fitted
classifier to predict the left out sample i.
38Class question
- Q What is the difference between the 2nd and
3rd variations? - A The gene screening stage is fixed with respect
to LOOCV in the 2nd variation, and isnt in the
3rd variation. - That allows us to assess the error in the gene
screening stage in the 3rd variation.
39About the 3 variations
- The 3rd variation is the only one that allows us
to check the correctness of out model. - Why?
- Because this is the only variation where we use
LOOCV to delete a sample from our input matrix,
and then try to estimate it. - In the other two variations we estimate a
sample after we used it in our process.
40Results
- Acute Leukemia Data
- Number of samples N 72.
- Number of genes p 3490.
- The multi-class
- AML 25 samples.
- B-ALL 38 samples.
- T-ALL 9 samples.
- New reduced dimension K 3.
41Results (cont.)
Notations Numbers in brackets the number of
times we demanded that the pairwise absolute mean
difference will pass the critical score. Numbers
not in brackets the number of genes that
passed. In A2 the three numbers are the
min-mean-max number of genes selected. (The Gene
screening process selects differently every
time) Data the error rate. Best result QDA.
42Article Criticism
- The article does present a model that seems to be
appropriate to solve the problem. - However, results show that there is a certain
error rate. (About 1/20). - The article was not clear on several subjects.
- Non the less, it was interesting to read.
43Questions?
44References
- The article Multi-class cancer Classification
via partial least squares with gene expression
profiles by Danh V. Nguyen and David M. Rocke. - Students t distribution http//en.wikipedia.org
/wiki/T_distribution - Student t test http//www.socialresearchmethods.n
et/kb/stat_t.htm - LOOCV http//www-2.cs.cmu.edu/schneide/tut5/node
42.html
45Appendix - Polychotomous Discrimination
explicit explanation.
- Why we define?
- To avoid calculating
- Explanation Remember that
- So
- So we dont have to calculate
46PD
- We assume we can write
- Remembering that we can get to
- This is our polychotomous regression model.
- Next, we assign beta to that formula (Replacing
- with ).
47PD
- Next, we define
- This holds our whole model.
- Now we want to maximize beta using MLE Maximum
Likelihood Estimation. - Well describe how to do that.
48PD
- Defining a notation
- Now, re-writing the formula from the two slides
back - So, by taking log, we get
- Next, define a row of indicators for a sample
- Where and
- Where states if the sample belongs to a
type cancer
49PD
- Now, Define a matrix
- Notice that Meaning that in every row of
, the sample was classified to exactly one
cancer class. - Using these notations, we conclude that the
likelihood for N independent samples is
50PD
- Taking log, we get the log-likelihood (which is
easier to compute).
51PD
- Next, remembering that we get that
- Now, this expression can be maximized to achieve
the MLE using the Newton-Raphson method. - One of the cases that the MLE exists is if there
exists a vector such
that where index set identifying all
samples in class r.
52Appendix References
- Article appendices http//dnguyen.ucdavis.edu/.ht
ml/SUP_cla2/SupplementalAppendix.pdf - Newton Raphson method - http//en.wikipedia.org/wi
ki/Newton-Raphson_method - On the Existence of Maximum Likelihood Estimates
in Logistic Regression Models. (A. Albert J. A.
Anderson 1984). http//www.qiji.cn/eprint/abs/2376
.html
53AbstractThis presentation deals with multi-class
cancer classification The process of
classifying samples into multiple types of
cancer. The article describes a 3-phase
algorithm scheme to demonstrate the
classification. The 3 phases are Gene Selection,
Dimension reduction and Classification. We
present one example of gene selection method, one
example of a dimension reduction method (MPLS),
and two classification methods (PD and QDA),
which we then compare between.The presentation
also presents concepts like class, multi-class,
t-test, and LOOCV.