Title: Differentially Expressed Genes, Class Discovery
1Differentially Expressed Genes, Class Discovery
Classification
2Finding Differentially Expressed Genes
- Two types of motivation
- Direct
- Relate the genes to known biology functions,
pathways etc. Infer about their rule, the
mechanisms governing the process etc. - Indirect Use as a pruning stage for tools
that perform learning tasks - Infer regulatory mechanisms and relations
- Classification ( disease Vs. normal, disease
subtypes)
3Example Tumor vs. Normal tissues
Normalsamples
Tumorsamples
- Identify differentially expressed genes
- Diagnostic Markers
- Therapeutic targets
- Understanding the disease process
Under expressed
Non-small cell lung carcinomas Sheba medical
center U. of Colorado Medical Center
4What We Need
- Score the genes, hopefully in a meaningful way..
- Attach a measure of statistical significance to
the score so we can - Choose a subset of genes wisely
- Have a measure of how strong our signal is
5Simplest Score Fold Change
6Fold Change problems
- Not reliable at the low end of the scale
- (0/0 effects large variance)
- Sensitive to outliers
- Variant pairwise fold change
- compute fold change over all possible sample
pairs - If in e.g. 75 of the pairs, change gt D gt
significant
7Relevance Scores - TNoM
- Beyond fold change
- Both genes have gt15 fold change
- TNoM (Total Number of Misclassifications) score
- Find the threshold that best separates tumors
from normals, - count the number of errors committed there.
tumor
normal
8Scoring Informative Genes
Expression pattern of a gene a Pathological
diagnosis information (annotation) L v(a,L), a
vector of s and s, ordered by the a values
- - - - - -
- a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
a12 a13 a14 a15
9TNoM Score
Find the threshold that best separates tumors
from normals, count the number of errors
committed there.
Ex 1
- - - - - - -
10TNoM vs. Fold Change
11TNoM
- Cons
- Ones-sided vs. two sided errors
- Absolute values ignored
- For any given level s, we can efficiently
compute p-Val(s) Prob( TNoM(V) ? s ),where V
is uniformly drawn over the appropriate space. - (H0 the gene expression values are independent
of the labels) - Computed using DP
12Wilcoxon Rank Test
- Another gene score, which similarly to TNoM
- Ignores absolute values
- Takes into account only order of measurements
- Sort the expression values of both groups
- - - - - - -
- - a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12
a13 a14 a15 - W(g) sum of ranks of the positive examples
- W(g) 1 2 5 6 7 10 13 14 58
13Wilcoxon Rank Test
- A common test in statistics
- Again, we can compute p-Values given the null
hypothesis H0 - P(W(g) gt sn,k) the probability of getting a
score gt s given a total of n samples, out of
which k are labeled as ().
14SAM (Tusher et al., PNAS 01)
- Where a (1/n1 1/n2)/(n1 n2-2)
- d(i) is exactly the paired t-statistic
- Tests the assumption are the means of the two
processes the same? - Underlying assumption two normal distributions
- A known p-value the t-distribution
15SAM Alternative to P-Value
- P-value relies on t-test assumptions -
problematic - Can we assess the significance of d(i) without
parametric assumptions? - Define a balanced permutation division of
samples to 2 groups, where in each group the
number of and - is balanced - Perform all possible balanced permutations p to
the data and compute
16False Discovery Rate for SAM
- Genes with D above a given threshold
significant - FDR False discovery rate the of genes
passing as significant which are expected to be
false positives - Each threshold on D(i) can be given an FDR value
- compute the avg. number of FP crossing this
threshold in the permuted sets
17Different Scores
- TNoM
- Info
- Wilcoxon
- t Test
- Fold Change
Different scores and null hypothesis (parametric,
non parametric etc.) All can be found in the
ScoreGene package http//www.cs.huji.ac.il/labs/
compbio/scoregenes/
Can we assess which scoring method is the best
for our case?
18Overabundance Analysis
- Data on 30 samples from normal and tumor lung
tissues. - 7000 genes.
- Naftali Kaminskis lab, Sheba Medical Center
19Why Test Overabundance?
- Tests how informative is a set of genes w.r.t. a
given classification of the data and a scoring
method. - Can be used to compare different
- gene scoring methods
- normalization methods
20Comparing Normalization Methods
21Why Test Overabundance?
- But also a method to discover new classes in the
data - Intuition biologically meaningful partitions
will have a high overabundance of informative
genes
22Overabundance Analysis in Class Discovery
AML/ALL
- Score Genes
- Count
- Compare torandom
BRCA1/2
Melanoma
23Class Discovery Approach
Seek partitions with statistically significant
overabundance of informative genes
- Use local search techniques, e.g
- Steepest ascent
- Simulated annealing
24Scoring a Partition
- At a given score level s, set p p-Val(s) .
- Suppose that in the data we observe n(s) genes
with score ? s . - The number of genes with score ? s we observe for
uniformly and independently drawn labeling
vectors is a random variable N(s) with N(s)
Binom(n,p)where n is the total number of genes. - The surprise rate at s is defined as ?(s)
Prob( N(s) ? n(s) ) ?kn(s)n n
choose n(s)pk(1-p)n-p. - Finally, the max surprise score for the suggested
partition is Maxs ?(s)
25Overabundance Max-Surprise
26Example Survival Prediction
Good Prognosis Patients
All Patients
27Class 2
Good Prognosis Patients
All Patients
28Class 3
Good Prognosis Patients
All Patients
29Tissue Classification
- Given a set of labeled samples, we can try to
classify a new sample - Supervised methods SVM, Adaboost, Naïve Bayes
- Semi-supervised methods Clustering
- Issues
- Evaluating the methods
- Feature Selection
- Sample contamination/composition
30Evaluating Classification
- LOOCV Leave one out cross validation
- For all samples i 1M
- Take sample i out
- Learn from M-1 remaining samples
- Test on sample i
31Feature Selection
- How many of the informative genes do we choose
for our classifier? - A question of choosing a cutoff
32Tissue Composition
Small celllung carcinoma
Lung adenocarcinoma
Serous carcinoma
Lung metastasa
33Tissue Composition
- The tissue is composed of many cell types (tumor,
blood, muscle, ) - The arrayed samples are not always pure!
- Major difference differentialy expressed genes
which are - Causes of the disease state
- Outcome of the disease state
34Summary
- Many methods for choosing differentially
expressed genes - These can be compared, e.g. using overabundance
tests - Overabundance can also be used for new class
discovery - Expression patterns can be used to classify a
tissue