Title: Computational Diagnosis
1 Sample classification using Microarray Data
2We have two sample entities
A
B
- malignant vs. benign tumor
- patient responding to drug vs. patient resistant
to drug - etc...
3From a tissue database we get biopsies for both
entities
A
B
4We do expression profiling ...
DNA Chip
Tissue
Expressionprofile
5... and the situation looks like this
What about differences in the profiles? Do they
exist? What do they mean?
A
B
6Which characteristics does this data have?
7This is a statistical question ...
... and the answer is coming from biology.
8The data describes a huge and complex interactive
network
?
... that we do not know.
9Expression profiles describe states of this
network
?
10Diseases can (should) be understood (defined) as
characteristic states of this network
?
11- This data is
- Very high dimensional
- Consists of highly dependent variables (genes)
12Back to the statistical classification problem
A
What is the problem with high dimensional
dependent data?
B
13A two gene scenario, where everything works out
fine
A
A new patient
A
B
14And here things go wrong
Problem 1 No separating line
Problem 2 To many separating lines
15And in 30000 dimensional spaces ...
...
1 2 3
30000
16- Problem 1 never exists!
- Problem 2 exists almost always!
Spent a minute thinking about this in three
dimensions Ok, there are three genes, two
patients with known diagnosis, one patient of
unknown diagnosis, and separating planes instead
of lines
OK! If all points fall onto one line it does not
always work. However, for measured values this is
very unlikely and never happens in praxis.
17In summary There is always a linear signature
separating the entities ... a biological reason
for this is not needed. Hence, if you find a
separating signature, it does not mean (yet) that
you have a nice publication ... ... in most cases
it means nothing.
18In general a separating linear signature is of no
significance at all.
Take it easy! There are meaningful differences in
gene expression ... we have build our science on
them ... and they should be reflected in the
profiles.
There are good signatures and bad ones!
19- Strategies for finding good signatures ...
- ... statistical learning
- Gene Selection
- Factor-Regression
- Support Vector Machines (machine learning)
- Informative Priors (Bayesian Statistics)
- .....
- What are the ideas?
20Gene selection Why does it help?
When considering all possible linear planes for
separating the patient groups, we always find one
that perfectly fits, without a biological reason
for this. When considering only planes that
depend on maximally 20 genes it is not guaranteed
that we find a well fitting signature. If in
spite of this it does exist, chances are better
that it reflects transcriptional disorder. If
we require in addition that the genes are all
good classifiers them selves, i.e. we find them
by screening using the t-Score, finding a
separating signature is even more exceptional.
21Gene selection What is it?
- Choose a small number of genes say 20 and
then fit a model using only these genes. - How to pick genes
- Screening ( i.e. t-score)
- single gene association
- Wrapping
- multiple gene association
- 1 . choose the best classifying single gene g1
- 2. choose the optimal complementing gene g2
- g1 and g2 are optimal for classification
- 3. etc .
22 Tradeoffs
- How many genes
- All linear planes
- All linear planes depending on at most 20 genes
- All linear planes depending on a given set of 20
genes
- How you find them
- Wrappingmultiple gene association
- Screening Single gene association
High probability for finding a fitting signature
Low probability that a signature is meaningful
Low probability for finding a fitting signature
High probability that a signature is meaningful
23More Generally
High probability for finding a fitting signature
Low probability that a signature is meaningful
Flexible model Smooth model
Low probability for finding a fitting signature
High probability that a signature is meaningful
24Factor-Regression
For example PCA, SVD ...
Use the first n (3-4) Factors only ... does not
work well with expression data, at least not with
a small number of samples
25Support Vector Machines
Fat planes With an infinitely thin plane the
data can always be separated correctly, but not
necessarily with a fat one. Again if a large
margin separation exists, chances are good that
we found something relevant. Large Margin
Classifiers
26 Informativer Priors
Likelihood
Prior
Posterior
27FirstSingular Value Decomposition
Singular values
Loadings
Data
Expression levels of super genes, orthogonal
matrix
28Keep all super-genesThe Prior Needs to Be
designed in n Dimensions
- n number of samples
- Shape?
- Center?
- Orientation?
- Not to narrow ... not to wide
29Shape
multidimensional normal for simplicity
30Center
Assumptions on the model correspond to
assumptions on the diagnosis
31Orientation
orthogonal super-genes !
32Not to Narrow ... Not to Wide
Auto adjusting model Scales are hyper parameters
with their own priors
33What are the additional assumptions that came
in by the prior?
- The model can not be dominated by only a few
super-genes ( genes! ) - The diagnosis is done based on global changes in
the expression profiles influenced by many genes - The assumptions are neutral with respect to the
individual diagnosis
34A common idea behind all models ...
All models confine the set of possible signatures
a priori however, they do it in different
ways. Gene selection aims for few genes in the
signature SVM go for large margins between data
points and the separating hyper-plane.
PC-Regression confine the signature to 3-4
independent factors The Bayesian model prefers
models with small weights (a la ridge-regression
or weight decay)
35... and a common problem of all modelsThe bias
variance trade off
- Model Complexity
- max number of genes
- minimal margin
- width of prior
- etc
36How come?
37Population mean Genes have a certain mean
expression and correlation in the population
38Sample mean We observe average expression and
empirical correlation
39Fitted model
40Regularization
41How much regularization do we need?
- The Bayesian answer What you do not know is a
random variable ... regularization becomes part
of the model - Or model selection by evaluations ...
42Model Selection with separate data
100 50 50
Training
Test
Selection
Split of some samples for Model Selection Train
the model on the training data with different
choices for the regularization parameter Apply it
to the selection data and optimize this parameter
(Model Selection) Test how good you are doing on
the test data (Model Assessment)
4310 Fold Cross-Validation
...
Train
Train
Train
Train
Select
Train
Train
Train
Train
Select
...
Chop up the training data (dont touch the test
data) into 10 sets Train on 9 of them and predict
the other Iterate, leave every set out
once Select a model according to the prediction
error (deviance)
44Leave one out Cross-Validation
...
1
Train
Train
Train
Train
Select
Train
Train
Train
Train
Select
1
...
Essentially the same But you only leave one
sample out at a time and predict it using the
others Good for small training sets
45Model Assessment
How well did I do? Can I use my signature for
clinical diagnosis? How well will it perform? How
does it compare to traditional methods?
46The most important thing Dont fool yourself!
(... and others)
This guy (and others) thought for some time he
could predict the nodal status of a breast tumor
from a profile taken from the primary tumor!
... there are significant differences. But not
good enough for prediction (West et al PNAS 2001)
47DOs AND DONTs
1. Decide on your diagnosis model
(PAM,SVM,etc...) and dont change your mind later
on 2. Split your profiles randomly into a
training set and a test set 3. Put the data in
the test set away. 4. Train your model only using
the data in the training set (select genes,
define centroids, calculate normal vectors for
large margin separators,perform model selection
...) dont even think of touching the test data
at this time 5. Apply the model to the test data
... dont even think of changing the model at
this time 6. Do steps 1-5 only once and accept
the result ... dont even think of optimizing
this procedure
48The selection bias
- - You can not select 20 genes using all your data
and then with this 20 genes split test and
training data and evaluate your method. - There is a difference between a model that
restricts signatures to depend on only 20 genes
and a data set that only contains 20 genes - Your model assessment will look much better than
it should
49(No Transcript)
50(No Transcript)