Computational Diagnosis - PowerPoint PPT Presentation

About This Presentation

Title:

Computational Diagnosis

Description:

Fat planes: With an infinitely thin plane the data can always be separated ... This guy (and others) thought for some time he could predict the nodal status of ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 51

Provided by: WinC9

Category:

more less

Transcript and Presenter's Notes

Title: Computational Diagnosis

1

Sample classification using Microarray Data
2
We have two sample entities
A
B

malignant vs. benign tumor
patient responding to drug vs. patient resistant
to drug
etc...

3
From a tissue database we get biopsies for both
entities
A
B
4

We do expression profiling ...
DNA Chip
Tissue
Expressionprofile
5
... and the situation looks like this
What about differences in the profiles? Do they
exist? What do they mean?
A
B
6
Which characteristics does this data have?
7
This is a statistical question ...
... and the answer is coming from biology.
8
The data describes a huge and complex interactive
network
?
... that we do not know.
9
Expression profiles describe states of this
network
?
10
Diseases can (should) be understood (defined) as
characteristic states of this network
?
11

This data is
Very high dimensional
Consists of highly dependent variables (genes)

12
Back to the statistical classification problem
A
What is the problem with high dimensional
dependent data?
B
13
A two gene scenario, where everything works out
fine
A
A new patient
A
B
14
And here things go wrong
Problem 1 No separating line
Problem 2 To many separating lines
15
And in 30000 dimensional spaces ...
...
1 2 3
30000
16

Problem 1 never exists!
Problem 2 exists almost always!

Spent a minute thinking about this in three
dimensions Ok, there are three genes, two
patients with known diagnosis, one patient of
unknown diagnosis, and separating planes instead
of lines
OK! If all points fall onto one line it does not
always work. However, for measured values this is
very unlikely and never happens in praxis.
17
In summary There is always a linear signature
separating the entities ... a biological reason
for this is not needed. Hence, if you find a
separating signature, it does not mean (yet) that
you have a nice publication ... ... in most cases
it means nothing.
18
In general a separating linear signature is of no
significance at all.
Take it easy! There are meaningful differences in
gene expression ... we have build our science on
them ... and they should be reflected in the
profiles.
There are good signatures and bad ones!
19

Strategies for finding good signatures ...
... statistical learning
Gene Selection
Factor-Regression
Support Vector Machines (machine learning)
Informative Priors (Bayesian Statistics)
.....
What are the ideas?

20
Gene selection Why does it help?
When considering all possible linear planes for
separating the patient groups, we always find one
that perfectly fits, without a biological reason
for this. When considering only planes that
depend on maximally 20 genes it is not guaranteed
that we find a well fitting signature. If in
spite of this it does exist, chances are better
that it reflects transcriptional disorder. If
we require in addition that the genes are all
good classifiers them selves, i.e. we find them
by screening using the t-Score, finding a
separating signature is even more exceptional.
21
Gene selection What is it?

Choose a small number of genes say 20 and
then fit a model using only these genes.
How to pick genes
Screening ( i.e. t-score)
single gene association
Wrapping
multiple gene association
1 . choose the best classifying single gene g1
2. choose the optimal complementing gene g2
g1 and g2 are optimal for classification
3. etc .

22

Tradeoffs

How many genes
All linear planes
All linear planes depending on at most 20 genes
All linear planes depending on a given set of 20
genes

How you find them
Wrappingmultiple gene association
Screening Single gene association

High probability for finding a fitting signature
Low probability that a signature is meaningful
Low probability for finding a fitting signature
High probability that a signature is meaningful
23
More Generally
High probability for finding a fitting signature
Low probability that a signature is meaningful
Flexible model Smooth model
Low probability for finding a fitting signature
High probability that a signature is meaningful
24
Factor-Regression
For example PCA, SVD ...
Use the first n (3-4) Factors only ... does not
work well with expression data, at least not with
a small number of samples
25
Support Vector Machines
Fat planes With an infinitely thin plane the
data can always be separated correctly, but not
necessarily with a fat one. Again if a large
margin separation exists, chances are good that
we found something relevant. Large Margin
Classifiers
26

Informativer Priors
Likelihood
Prior
Posterior
27
FirstSingular Value Decomposition
Singular values
Loadings
Data
Expression levels of super genes, orthogonal
matrix
28
Keep all super-genesThe Prior Needs to Be
designed in n Dimensions

n number of samples
Shape?
Center?
Orientation?
Not to narrow ... not to wide

29
Shape
multidimensional normal for simplicity
30
Center
Assumptions on the model correspond to
assumptions on the diagnosis
31
Orientation
orthogonal super-genes !
32
Not to Narrow ... Not to Wide
Auto adjusting model Scales are hyper parameters
with their own priors
33
What are the additional assumptions that came
in by the prior?

The model can not be dominated by only a few
super-genes ( genes! )
The diagnosis is done based on global changes in
the expression profiles influenced by many genes
The assumptions are neutral with respect to the
individual diagnosis

34
A common idea behind all models ...
All models confine the set of possible signatures
a priori however, they do it in different
ways. Gene selection aims for few genes in the
signature SVM go for large margins between data
points and the separating hyper-plane.
PC-Regression confine the signature to 3-4
independent factors The Bayesian model prefers
models with small weights (a la ridge-regression
or weight decay)
35
... and a common problem of all modelsThe bias
variance trade off

Model Complexity
max number of genes
minimal margin
width of prior
etc

36
How come?
37
Population mean Genes have a certain mean
expression and correlation in the population
38
Sample mean We observe average expression and
empirical correlation
39
Fitted model
40
Regularization
41
How much regularization do we need?

The Bayesian answer What you do not know is a
random variable ... regularization becomes part
of the model
Or model selection by evaluations ...

42
Model Selection with separate data
100 50 50
Training
Test
Selection
Split of some samples for Model Selection Train
the model on the training data with different
choices for the regularization parameter Apply it
to the selection data and optimize this parameter
(Model Selection) Test how good you are doing on
the test data (Model Assessment)
43
10 Fold Cross-Validation
...
Train
Train
Train
Train
Select
Train
Train
Train
Train
Select
...
Chop up the training data (dont touch the test
data) into 10 sets Train on 9 of them and predict
the other Iterate, leave every set out
once Select a model according to the prediction
error (deviance)
44
Leave one out Cross-Validation
...
1
Train
Train
Train
Train
Select
Train
Train
Train
Train
Select
1
...
Essentially the same But you only leave one
sample out at a time and predict it using the
others Good for small training sets
45
Model Assessment
How well did I do? Can I use my signature for
clinical diagnosis? How well will it perform? How
does it compare to traditional methods?
46
The most important thing Dont fool yourself!
(... and others)
This guy (and others) thought for some time he
could predict the nodal status of a breast tumor
from a profile taken from the primary tumor!
... there are significant differences. But not
good enough for prediction (West et al PNAS 2001)
47
DOs AND DONTs
1. Decide on your diagnosis model
(PAM,SVM,etc...) and dont change your mind later
on 2. Split your profiles randomly into a
training set and a test set 3. Put the data in
the test set away. 4. Train your model only using
the data in the training set (select genes,
define centroids, calculate normal vectors for
large margin separators,perform model selection
...) dont even think of touching the test data
at this time 5. Apply the model to the test data
... dont even think of changing the model at
this time 6. Do steps 1-5 only once and accept
the result ... dont even think of optimizing
this procedure
48
The selection bias

- You can not select 20 genes using all your data
and then with this 20 genes split test and
training data and evaluate your method.
There is a difference between a model that
restricts signatures to depend on only 20 genes
and a data set that only contains 20 genes
Your model assessment will look much better than
it should

49
(No Transcript)
50
(No Transcript)

Write a Comment

User Comments (0)