Multi-Class Cancer Classification - PowerPoint PPT Presentation

About This Presentation

Title:

Multi-Class Cancer Classification

Description:

Example: We want to be able to determine if a certain sample belongs to a ... Now, re-writing the formula from the two s back: So, by taking log, we get: ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 54

Provided by: noaml

Category:

more less

Transcript and Presenter's Notes

Title: Multi-Class Cancer Classification

1
Multi-Class Cancer Classification

Noam Lerner

2
Our world very generally

Genes.
Gene samples.
Our goal classifying the samples.
Example We want to be able to determine if a
certain sample belongs to a certain type of
cancer.

3
Our problem

Say that we have p genes, and N samples.
Normally, pltN, so its easy to classify samples.
What if Nltp?

4
The algorithm scheme

Gene screening.
Dimension reduction.
Classification.
Well present 3 variations of this algorithm
scheme.

5
Before gene screening - Classes

Normally, a class of genes is a set of genes that
behave similarly under certain conditions.
Example One can divide genes into a class of
genes that indicate a certain type of cancer, and
to another class of genes, that do not indicate.
Taking it one step further

6
Multi-classes

Diving a group of genes into two or more classes
is called a Multi-class.
What is it good for?
Distinguishing between types of cancer.
Example Leukemia
AML
B-ALL
T-ALL

7
Gene Screening

Generally, gene screening is a method that is
used to disregard unimportant genes.
Example gene predictors.

8
The Gene Screening process

Suppose we have G classes that represent G types
of cancer. (We know which genes belong in each
class).
We compare every two classes pair-wise and see if
the expression is greater than a
certain critical score. ( is the mean of the
r-th set of the multi set)

9
What is the critical score?

MSE mean squared error.
- the size of the r-th multi set.
t arises from students t-distribution.

10
Students t-distribution

t distribution is used to estimate the mean and
variance of a normally distributed population
when the sample size is small.
Fact The t distribution depends on the size of
the population, but not on the mean nor in the
variance of the items in the population. The lack
of dependence is what makes the t-distribution
important in both theory and practice.
Anecdote William S. Gosset published a paper on
this subject, under the pseudonym student, and
thats how the distribution got its name.

11
The student t test

The t-test assesses whether the means of two
groups are statistically different from each
other.
This analysis is appropriate whenever you want to
compare the means of two groups.
It is assumed that the two groups have the same
variance.

12
The student t test (cont.)

Consider the next three situations

13
The student t test (cont.)

The first thing to notice about the three
situations is that the difference between the
means is the same in all three.
We would want to conclude that the two groups are
similar in the high-variability case, and the two
groups are distinct in the low-variability case.
Conclusion when we are looking at the
differences between scores for two groups, we
have to judge the difference between their means
relative to the spread or variability of their
scores. The student t-test does just this.

14
The student t test (cont.)

We say that two classes passed the student t test
if the t is greater than a certain parameter.
Risk level usually
Degrees of freedom
Look it up in a table.

15
Dimension Reduction

It appears that we need more than Gene Screening.
Reminder We have p genes, N samples, Nltp.
Most classification methods (the next phase of
the algorithm) assume that pltN.
The solution dimension reduction reducing the
gene space dimension from p to K where K ltlt N.

16
Dimension Reduction (cont.)

This is done by constructing K gene components
and then classifying the cancers based on the
constructed K gene components.
Multivariate Partial Least Squares (MPLS) is a
dimension reduction method.
Example

17
Example

Reducing dimension from 35 to 3 (5 classes).

18
Example (cont.)
This is the NCI60 data set that contains 5
various types of cancer.
19
MPLS

Suppose we have G classes.
Suppose y indicates the cancer classes 1,,G.
We define a row for every sample
Fix a K (our desired reduction dimension)

20
MPLS (cont.)

Suppose X is the gene expression values matrix.
Suppose t1,,tK are linear combination of X.
Then, the MPLS finds (easily) two unit vectors w
and c such that the following expression is
maximized
Then , the MPLS extracts t1,,tK, and we are done.

21
Why maximizing the covariance?

If cov(x,y)gt0 then y increases as x increases.
If cov(x,y)lt0 then y decreases as x increases.
By maximizing the covariance, we get that Yc
increases as Xw increases.
That way, we get a good estimation of Yc by Xw,
and we have found our MPLS components t1,,tK.

22
Classification

After we have reduced the dimension of the gene
space, we need to actually classify the
sample(s).
Its important to pick a classification method
that will work properly after dimension
reduction.
Well present two different methods PD and QDA.

23
PD (Polychotomous Discrimination)

Recall the indicator y that indicates the cancer
classes 1,,G.
Set a vector .
Then, the distribution of y depends on x. (We
think of y as a random variable).
We also suppose that

24
PD (cont.)

We define
After a few mathematical transitions we get that
This is the probability that a sample with gene
expression profile x is of cancer class r.

25
PD (cont.)

By looking at the previous formula through a
certain mathematical model, we can maximize a
parameter, that holds all the data.
The parameter can be maximized only if there are
more samples (N) than parameters (p), and by
using dimension reduction, we got just that.

26
PD (cont.)

So, instead of looking at
well look at the corresponding gene component
profile, .
Now, lets look at the new probabilities, that
rely on the new .
Finally, well say that (and therefore )
belong to the r-th cancer class if
A more detailed explanation on PD is given on the
presentations appendix.

27
QDA (Quadratic Discriminant Analysis)

Recall the indicator y that indicates the cancer
classes 1,,G.
Consider the following multivariate normal model
(for each cancer class)

28
QDA (cont.)

Suppose is the classification of the r-th
cancer class, then
Where
is s pdf function.

29
QDA (cont.)

Again, instead of looking at
well look at the
corresponding gene component profile, ,
and get the desired classification.

30
Review - the big picture

Gene screening allows us to get rid of genes
that wont tell us anything.
Dimension reduction allows us to reduce the the
gene space and work on the data.
Classification allows us to decide if a sample
has a cancer of a certain multi-class.

31
Just before the algorithm

We would want a way to assess if we generated a
correct classification.
In order to do that we use LOOCV.

32
LOOCV

LOOCV stands for Leave Out One Cross Validation.
In this process, we remove one data point from
our data, run our algorithm, and try to estimate
the removed data point using our results, as if
we didnt know the original data point. Then, we
assess the error.
This step is repeated for every data point, and
finally, we accumulate the errors in some sort
for a final error estimation.

33
The 1st algorithm variation

Gene screening select a set S of m genes, giving
an expression matrix X of size N x m.
Dimension reduction Use MPLS to reduce X to T
where T is of size N x K.
Classification For i1 to N do
Leave out sample (row) i of T,
Fit the classifier to the remaining N-1 samples
and use the fitted classifier to predict the left
out sample i.

34
The 2nd algorithm variation

Gene screening select a set S of m genes, giving
an expression matrix X of size N x m.
For i1 to N do
Leave out sample (row) i of the expression matrix
X creating X-i
Dimension reduction Use MPLS to reduce X-i to
T-i where T-i is of size N x K.
Classification Fit the classifier to the
remaining N-1 samples and use the fitted
classifier to predict the left out sample i.

35
Class question

Q What is the difference between the 1st and 2nd
variations?
A1 In the 1st variation, steps 1 and 2 are fixed
with respect to LOOCV. Therefore, the effect of
gene screening and dimension reduction on the
classification cannot be assessed.
A2 In the 2nd variation, we can assess the
effect of the dimension reduction.

36
More on the 1st variation

Results show that the 1st variation does not
yield good results. (The classification error
rates were more optimistic than the expected
error rates.
Taking it to the next level

37
The 3rd algorithm variation

For i1 to N do
Leave out sample (row) i of the original
expression matrix X0.
Gene screening select a set S-i of m genes,
giving an expression matrix X-i of size N-1 x m.
Dimension reduction Use MPLS to reduce X-i to
T-i where T-i is of size N-1 x K.
Classification Fit the classifier to the
remaining N-1 samples and use the fitted
classifier to predict the left out sample i.

38
Class question

Q What is the difference between the 2nd and
3rd variations?
A The gene screening stage is fixed with respect
to LOOCV in the 2nd variation, and isnt in the
3rd variation.
That allows us to assess the error in the gene
screening stage in the 3rd variation.

39
About the 3 variations

The 3rd variation is the only one that allows us
to check the correctness of out model.
Why?
Because this is the only variation where we use
LOOCV to delete a sample from our input matrix,
and then try to estimate it.
In the other two variations we estimate a
sample after we used it in our process.

40
Results

Acute Leukemia Data
Number of samples N 72.
Number of genes p 3490.
The multi-class
AML 25 samples.
B-ALL 38 samples.
T-ALL 9 samples.
New reduced dimension K 3.

41
Results (cont.)
Notations Numbers in brackets the number of
times we demanded that the pairwise absolute mean
difference will pass the critical score. Numbers
not in brackets the number of genes that
passed. In A2 the three numbers are the
min-mean-max number of genes selected. (The Gene
screening process selects differently every
time) Data the error rate. Best result QDA.
42
Article Criticism

The article does present a model that seems to be
appropriate to solve the problem.
However, results show that there is a certain
error rate. (About 1/20).
The article was not clear on several subjects.
Non the less, it was interesting to read.

43
Questions?

Thank you for listening.

44
References

The article Multi-class cancer Classification
via partial least squares with gene expression
profiles by Danh V. Nguyen and David M. Rocke.
Students t distribution http//en.wikipedia.org
/wiki/T_distribution
Student t test http//www.socialresearchmethods.n
et/kb/stat_t.htm
LOOCV http//www-2.cs.cmu.edu/schneide/tut5/node
42.html

45
Appendix - Polychotomous Discrimination
explicit explanation.

Why we define?
To avoid calculating
Explanation Remember that
So
So we dont have to calculate

46
PD

We assume we can write
Remembering that we can get to
This is our polychotomous regression model.
Next, we assign beta to that formula (Replacing
with ).

47
PD

Next, we define
This holds our whole model.
Now we want to maximize beta using MLE Maximum
Likelihood Estimation.
Well describe how to do that.

48
PD

Defining a notation
Now, re-writing the formula from the two slides
back
So, by taking log, we get
Next, define a row of indicators for a sample
Where and
Where states if the sample belongs to a
type cancer

49
PD

Now, Define a matrix
Notice that Meaning that in every row of
, the sample was classified to exactly one
cancer class.
Using these notations, we conclude that the
likelihood for N independent samples is

50
PD

Taking log, we get the log-likelihood (which is
easier to compute).

51
PD

Next, remembering that we get that
Now, this expression can be maximized to achieve
the MLE using the Newton-Raphson method.
One of the cases that the MLE exists is if there
exists a vector such
that where index set identifying all
samples in class r.

52
Appendix References

Article appendices http//dnguyen.ucdavis.edu/.ht
ml/SUP_cla2/SupplementalAppendix.pdf
Newton Raphson method - http//en.wikipedia.org/wi
ki/Newton-Raphson_method
On the Existence of Maximum Likelihood Estimates
in Logistic Regression Models. (A. Albert J. A.
Anderson 1984). http//www.qiji.cn/eprint/abs/2376
.html

53
AbstractThis presentation deals with multi-class
cancer classification The process of
classifying samples into multiple types of
cancer. The article describes a 3-phase
algorithm scheme to demonstrate the
classification. The 3 phases are Gene Selection,
Dimension reduction and Classification. We
present one example of gene selection method, one
example of a dimension reduction method (MPLS),
and two classification methods (PD and QDA),
which we then compare between.The presentation
also presents concepts like class, multi-class,
t-test, and LOOCV.

Write a Comment

User Comments (0)