Title: Analysis of Multiple Experiments TIGR Multiple Experiment Viewer MeV
1Analysis of Multiple ExperimentsTIGR Multiple
Experiment Viewer (MeV)
2Advanced Course Coverage
- Introduction
- -fundamental concepts, expression vectors and
distance metrics - -fundamental statistical concepts encountered in
mev analysis modules - Algorithm Coverage
- -Lecture / Hands on Exercises
- (refer to algorithm handout for order)
-
3Microarray Data Flow
Scheduler (Machine Scheduling)
SliTrack (Machine Control)
PCR Score
MABCOS (Barcode System)
Exp Designer
.tiff Image File
Spotfinder (Image Analysis)
MADAM (Data Manager)
Expression Data
Raw .tav File
Miner (.tav File Creator)
Raw .tav File
MIDAS (Normalization)
GenePix Converter
Normalized .tav File
Query Window
MeV (Data Analysis)
Interpretation
4The Expression Matrix is a representation of data
from multiple microarray experiments.
Each element is a log ratio (usually log 2 (Cy5 /
Cy3) )
Black indicates a log ratio of zero, i. e., Cy5
and Cy3 are very close in value
Green indicates a negative log ratio , i.e., Cy5
lt Cy3
Gray indicates missing data
Red indicates a positive log ratio, i.e, Cy5 gt
Cy3
5Expression Vectors
- -Gene Expression Vectors
- encapsulate the expression of a gene over a set
of experimental conditions or sample types. -
Log2(cy5/cy3)
6Expression Vectors As Points inExpression Space
Exp 1
Exp 2
Exp 3
G1
-0.8
-0.3
-0.7
G2
-0.8
-0.7
-0.4
Similar Expression
G3
-0.4
-0.6
-0.8
G4
0.9
1.2
1.3
G5
1.3
0.9
-0.6
Experiment 3
Experiment 2
Experiment 1
7Distance and Similarity
-the ability to calculate a distance (or
similarity, its inverse) between two expression
vectors is fundamental to clustering
algorithms -distance between vectors is the basis
upon which decisions are made when grouping
similar patterns of expression -selection of a
distance metric defines the concept of distance
8Distance a measure of similarity between genes.
p1
- Some distances (MeV provides 11 metrics)
- Euclidean ??i 1 (xiA - xiB)2
p0
3. Pearson correlation
9Distance is Defined by a Metric
1.4
-0.90
4.2
-1.00
10Statistical Concepts
11Probability distributions
The probability of an event is the likelihood of
its occurring. It is sometimes computed as a
relative frequency (rf), where the number of
favorable outcomes for an event rf
------------------------------------------------
---------------- the total number of possible
outcomes for that event.
The probability of an event can sometimes be
inferred from a theoretical probability
distribution, such as a normal distribution.
12Normal distribution
s std. deviation of the distribution
X µ (mean of the distribution)
13Less than a 5 chance that the sample with mean s
came from population 1, i.e., s is significantly
different from mean 1 at the p lt 0.05
significance level. But we cannot reject the
hypothesis that the sample came from population 2.
14Many biological variables, such as height and
weight, can reasonably be assumed to approximate
the normal distribution. But expression
measurements? Probably not. Fortunately, many
statistical tests are considered to be fairly
robust to violations of the normality
assumption, and other assumptions used in these
tests. Randomization / resampling based tests
can be used to get around the violation of the
normality assumption. Even when parametric
statistical tests (the ones that make use of
normal and other distributions) are valid,
randomization tests are still useful.
15Outline of a randomization test - 1
- Compute the value of interest (i.e., the
test-statistic s) from your data set.
s
Original data set
- Make fake data sets from your original data, by
taking a random sub-sample of the data, or by
re-arranging the data in a random fashion. - Re-compute s from the fake data set.
fake s
fake s
fake s
. . .
Randomized data sets
16Outline of a randomization test - 2
4. Repeat steps 2 and 3 many times (often several
hundred to several thousand times). Keep a
record of the fake s values from step 3. 5.
Draw inferences about the significance of your
original s value by comparing it with the
distribution of the randomized (fake) s values.
Original s value could be significant as it
exceeds most of the randomized s values
Range of randomized s values
17Outline of a randomization test - 3
Rationale Ideally, we want to know the
behavior of the larger population from which
the sample is drawn, in order to make
statistical inferences. Here, we dont know
that the larger population behaves like a
normal distribution, or some other idealized
distribution. All we have to work with are the
data in hand. Our fake data sets are our best
guess about this behavior (i.e., if we had been
pulling data at random from an infinitely large
population, we might expect to get a
distribution similar to what we get by pulling
random sub-samples, or by reshuffling the order
of the data in our sample)
18- The problem of multiple testing
- (adapted from presentation by Anja von
Heydebreck, MaxPlanckInstitute for Molecular
Genetics, - Dept. Computational Molecular Biology, Berlin,
Germany - http//www.bioconductor.org/workshops/Heidelberg02
/mult.pdf) - Lets imagine there are 10,000 genes on a chip,
AND - None of them is differentially expressed.
- Suppose we use a statistical test for
differential - expression, where we consider a gene to be
differentially expressed if it meets the
criterion at a - p-value of p lt 0.05.
19- The problem of multiple testing 2
- Lets say that applying this test to gene G1
yields a p-value of p 0.01 - Remember that a p-value of 0.01 means that there
is a 1 chance that the gene is not
differentially expressed, i.e., - Even though we conclude that the gene is
differentially expressed (because p lt 0.05),
there is a 1 chance that our conclusion is
wrong. - We might be willing to live with such a low
probability - of being wrong
- BUT .....
20- The problem of multiple testing 3
- We are testing 10,000 genes, not just one!!!
- Even though none of the genes is differentially
expressed, about 5 of the genes (i.e., 500
genes) will be erroneously concluded to be
differentially expressed, because we have decided
to live with a p-value of 0.05 - If only one gene were being studied, a 5 margin
of error might not be a big deal, but 500 false
conclusions in one study? That doesnt sound too
good.
21- The problem of multiple testing - 4
- There are tricks we can use to reduce the
severity of - this problem.
- They all involve slashing the p-value for each
test - (i.e., gene), so that while the critical p-value
for the entire - data set might still equal 0.05, each gene will
be - evaluated at a lower p-value.
- Well go into some of these techniques later.
22- Dont get too hung up on p-values.
- Ultimately, what matters is biological
relevance. - P-values should help you evaluate the strength of
the - evidence, rather than being used as an absolute
yardstick - of significance. Statistical significance is not
necessarily - the same as biological significance.
23- i.e., you dont want to belong to that group of
people whose aim in life is to be wrong 5 of the
time!!!
Kempthorne, O., and T.E. Deoerfler 1969 The
behaviour of some significance tests under
experimental randomization. Biometrika
56231-248, as cited in Manly, B.J.F. 1997.
Randomization, bootstrap and Monte Carlo methods
in biology pg. 1. Chapman and Hall / CRC
24- Pearson correlation coefficient r
- Indicates the degree to which a linear
relationship can be approximated between two
variables. - Can range from (1.0) to (1.0).
- Positive r between two variables X and Y as X
increases, so does Y on the whole.
- Negative r as X increases, Y generally
decreases. - The higher the magnitude of r (in the positive
or negative direction), the more linear the
relationship.
25- Pearson correlation - 2
- Sometimes, a p-value is associated with the
correlation coefficient r. - This p-value is computed from a theoretical
distribution of the correlation coefficient,
similar to the normal distribution.
This is the p-value for the null hypothesis
that the X and Y data for our sample come from a
population in which their correlation is zero,
i.e., the null hypothesis is that there is no
linear relationship between X and Y. If p is
sufficiently small (often p lt 0.05), we can
reject the null hypothesis, i.e., we conclude
that there is indeed a linear relationship
between X and Y.
26Pearson correlation - 3 The square of the
Pearson correlation, r2, also known as the
coefficient of determination, is a measure of the
strength of the linear relationship between X
and Y. It is the proportion of the total
variation in X and Y that is explained by a
linear relationship.
27Algorithms
28Hierarchical Clustering (HCL)
HCL is an agglomerative clustering method which
joins similar genes into groups. The iterative
process continues with the joining of resulting
groups based on their similarity until all groups
are connected in a hierarchical tree.
(HCL-1)
29Hierarchical Clustering
g1 is most like g8
g4 is most like g1, g8
(HCL-2)
30Hierarchical Clustering
g5 is most like g7
g5,g7 is most like g1, g4, g8
(HCL-3)
31Hierarchical Tree
(HCL-4)
32Hierarchical Clustering
During construction of the hierarchy, decisions
must be made to determine which clusters should
be joined. The distance or similarity between
clusters must be calculated. The rules that
govern this calculation are linkage methods.
(HCL-5)
33Agglomerative Linkage Methods
- Linkage methods are rules or metrics that return
a value that can be used to determine which
elements (clusters) should be linked. - Three linkage methods that are commonly used are
- Single Linkage
- Average Linkage
- Complete Linkage
(HCL-6)
34Single Linkage
Cluster-to-cluster distance is defined as the
minimum distance between members of one cluster
and members of the another cluster. Single
linkage tends to create elongated clusters with
individual genes chained onto clusters. DAB
min ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
DAB
(HCL-7)
35Average Linkage
Cluster-to-cluster distance is defined as the
average distance between all members of one
cluster and all members of another cluster.
Average linkage has a slight tendency to produce
clusters of similar variance. DAB 1/(NANB) S
S ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
DAB
(HCL-8)
36Complete Linkage
Cluster-to-cluster distance is defined as the
maximum distance between members of one cluster
and members of the another cluster. Complete
linkage tends to create clusters of similar size
and variability. DAB max ( d(ui, vj) ) where
u Î A and v Î B for all i 1 to NA and j 1 to
NB
DAB
(HCL-9)
37Comparison of Linkage Methods
(HCL-10)
38Bootstrapping (ST)
Bootstrapping resampling with replacement
Original expression matrix
Various bootstrapped matrices (by experiments)
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
39Jackknifing (ST)
Jackknifing resampling without replacement
Original expression matrix
Various jackknifed matrices (by experiments)
40Analysis of Bootstrapped and Jackknifed Support
Trees
- Bootstrapped or jackknifed expression matrices
are created many times by randomly resampling the
original expression matrix, using either the
bootstrap or jackknife procedure. - Each time, hierarchical trees are created from
the resampled matrices. - The trees are compared to the tree obtained from
the original data set. - The more frequently a given cluster from the
original tree is found in the resampled trees,
the stronger the support for the cluster. - As each resampled matrix lacks some of the
original data, high support for a cluster means
that the clustering is not biased by a small
subset of the data.
41K-Means / K-Medians Clustering (KMC) 1
1. Specify number of clusters, e.g., 5.
2. Randomly assign genes to clusters.
42K-Means Clustering 2
3. Calculate mean / median expression profile of
each cluster.
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean / median
expression profile (calculated in step 3) is the
closest to that genes expression profile.
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached.
K-Means / K-Medians is most useful when the user
has an a-priori hypothesis about the number of
clusters the genes should group into.
43Principal Components (PCAG and PCAE) 1
- PCA simplifies the views of the data.
- Suppose we have measurements for each gene on
multiple - experiments.
- Suppose some of the experiments are correlated.
- PCA will ignore the redundant experiments, and
will take a - weighted average of some of the experiments, thus
possibly making - the trends in the data more interpretable.
- 5. The components can be thought of as axes in
n-dimensional - space, where n is the number of components. Each
axis represents a - different trend in the data.
44PCAG and PCAE - 2
In this example, x-axis could mean a continuum
from over-to under-expression (blue and
green genes over-expressed, yellow genes
under-expressed) y-axis could mean that gray
genes are over-expressed in first five expts and
under expressed in The remaining expts, while
brown genes are under-expressed in the first
five expts, and over-expressed in the remaining
expts. z-axis might represent different cyclic
patterns, e.g., red genes might be
over-expressed in odd-numbered expts and
under-expressed in even-numbered ones, whereas
the opposite is true for purple
genes. Interpretation of components is somewhat
subjective.
45Cluster Affinity Search Technique (CAST)
-uses an iterative approach to segregate elements
with high affinity into a cluster -the process
iterates through two phases -addition of high
affinity elements to the cluster being
created -removal or clean-up of low affinity
elements from the cluster being created
46Clustering Affinity Search Technique (CAST)-1
Affinity a measure of similarity between a
gene, and all the genes in a cluster. Threshold
affinity user-specified criterion for retaining
a gene in a cluster, defined as age of maximum
affinity at that point
1. Create a new empty cluster C1.
2. Set initial affinity of all genes to zero
3. Move the two most similar genes into the new
cluster.
4. Update the affinities of all the genes (new
affinity of a gene its previous affinity its
similarity to the gene(s) newly added to the
cluster C1)
ADD GENES
5. While there exists an unassigned gene whose
affinity to the cluster C1 exceeds
the user-specified threshold affinity, pick the
unassigned gene whose affinity is the
highest, and add it to cluster C1. Update the
affinities of all the genes accordingly.
47CAST 2
REMOVE GENES
6. When there are no more unassigned
high-affinity genes, check to see if cluster C1
contains any elements whose affinity is lower
than the current threshold. If so, remove the
lowest-affinity gene from C1. Update the
affinities of all genes by subtracting from each
genes affinity, its similarity to the removed
gene.
7. Repeat step 6 while C1 contains a low-affinity
gene.
G13
G3
G8
Current cluster C1
G2
G4
G6
G12
G14
G5
G9
G11
G7
G1
G10
G15
Unassigned genes
8. Repeat steps 5-7 as long as changes occur to
the cluster C1.
9. Form a new cluster with the genes that were
not assigned to cluster C1, repeating steps 1-8.
10. Keep forming new clusters following steps
1-9, until all genes have been assigned to a
cluster
48QT-Clust (from Heyer et. al. 1999) (HJC) -1
- Compute a jackknifed distance between all pairs
of genes - (Jackknifed distance The data from one
experiment are excluded from both genes, and the - distance is calculated. Each experiment is thus
excluded in turn, and the maximum distance - between the two genes (over all exclusions) is
the jackknifed distance. This is a conservative - estimate of distance that accounts for bias that
might be introduced by single outlier
experiments.)
2. Choose a gene as the seed for a new cluster.
Add the gene which increases cluster diameter
the least. Continue adding genes until
additional genes will exceed the specified
cluster diameter limit.
3. Repeat step 2 for every gene, so that each
gene has the chance to be the seed of a new
cluster. All clusters are provisional at this
point.
49QT-Clust 2
4. Choose the largest cluster obtained from steps
2 and 3. In case of a tie, pick one of the
largest clusters at random.
G4
G9
G3
Seed gene
Pick this cluster
5. All genes that are not in the cluster selected
above are treated as currently unassigned.
Repeat steps 2-4 on these unassigned genes.
6. Stop when the last cluster thus formed has
fewer genes than a user-specified number. All
genes that are not in a cluster at this point are
treated as unassigned.
50Self Organizing Tree Algorithm
SOTA - 1
- Dopazo, J. , J.M Carazo, Phylogenetic
reconstruction using and unsupervised growing
neural network that adopts the topology of a
phylogenetic tree. J. Mol. Evol. 44226-233,
1997. - Herrero, J., A. Valencia, and J. Dopazo. A
hierarchical unsupervised growing neural network
for clustering gene expression patterns.
Bioinformatics, 17(2)126-136, 2001.
51SOTA Characteristics
SOTA - 2
- Divisive clustering, allowing high level
hierarchical structure to be revealed without
having to completely partition the data set down
to single gene vectors - Data set is reduced to clusters arranged in a
binary tree topology - The number of resulting clusters is not fixed
before clustering - Neural network approach which has advantages
similar to SOMs such as handling large data sets
that have large amounts of noise
52SOTA Topology
SOTA - 3
Centroid Vector
Parent Node
ap
Members
as
aw
Winning Cell
Sister Cell
a migration factor (as lt ap lt aw)
53Adaptation Overview
SOTA - 4
-each gene vector associated with the parent is
compared to the centroid vector of its offspring
cells. -the most similar cells centroid and
its neighboring cells are adapted using the
appropriate migration weights.
54SOTA - 5
-following the presentation of all genes to the
system a measure of system diversity is used to
determine if training has found an optimal
position for the offspring. -if the system
diversity improves (decreases) then another
training epoch is started otherwise training ends
and a new cycle starts with a cell division.
55SOTA - 6
The most diverse cell is selected for division
at the start of the next training cycle.
56Growth Termination
SOTA - 7
Expansion stops when the most diverse cells
diversity falls below a threshold.
57SOTA - 8
Each training cycle ends when the overall tree
diversity stabilizes. This triggers a cell
division and possibly a new training cycle.
58Self-organizing maps (SOMs) 1
1. Specify the number of nodes (clusters)
desired, and also specify a 2-D geometry for the
nodes, e.g., rectangular or hexagonal
N Nodes G Genes
59SOMs 2
2. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The
node closest to G9 (N2) is moved the most, and
the other nodes are moved by smaller varying
amounts. The further away the node is from N2,
the less it is moved.
60SOM Neighborhood Options
Gaussian Neighborhood
Bubble Neighborhood
radius
G7
G7
G8
G8
G9
G9
G10
G10
G11
G11
N1
N2
N1
N2
N3
N4
N3
N4
N5
N6
N5
N6
Some move, alpha is constant.
All move, alpha is scaled.
61SOMs 3
4. Steps 2 and 3 (i.e., choosing a random gene
and moving the nodes towards it) are repeated
many (usually several thousand) times. However,
with each iteration, the amount that the nodes
are allowed to move is decreased.
5. Finally, each node will nestle among a
cluster of genes, and a gene will be considered
to be in the cluster if its distance to the node
in that cluster is less than its distance to any
other node
G7
G8
G1
G6
G5
G9
N2
G2
N1
G4
G10
G3
G11
G12
G13
N4
G14
G15
G26
G27
N3
G29
G28
G16
G17
G19
G18
G20
G23
N6
G21
G24
N5
G22
G25
62Template Matching
-template matching allows one to find expression
vectors which match a provided template -a
template can be derived from - a gene known to
be central to the area of study - a sample or
set of samples of a particular type - a cluster
with a mean pattern of interest - a pattern
constructed to reveal trends based on
knowledge of the experimental design
63PTM-2
-Sometimes it is useful to identify elements that
have complementary patterns by selecting to use
the absolute value of r.
64K-Means / K-Medians Support (KMS)
- Because of the random initialization of K-Means /
K-Means, - clustering results may vary somewhat between
successive runs on - the same dataset. KMS helps us validate the
clustering results - obtained from K-Means / K-Medians.
- Run K-Means / K-Medians multiple times.
- The KMS module generates clusters in which the
member genes - frequently group together in the same clusters
(consensus clusters) - across multiple runs of K-Means / K-Medians.
- 3. The consensus clusters consist of genes that
clustered together - in at least x of the K-Means / Medians runs,
where x is the - threshold percentage input by the user.
65Gene Shaving
Results in a series of nested clusters
Choose cluster of appropriate size as determined
by gap statistic calculation
Repeat until only one gene remains
Orthogonalize expression matrix with respect to
the average gene in the cluster and repeat
shaving procedure
66Gene Shaving
Gap statistic calculation (choosing cluster size)
Quality measure for clusters
between variance of mean gene across experiments
within variance of each gene about the cluster
average
Large R2 implies a tight cluster of coherent genes
The final cluster contains a set of genes that
are greatly affected by the experimental
conditions in a similar way.
Create random permutations of the expression
matrix and calculate R2 for each
Compare R2 of each cluster to that of the entire
expression matrix
Choose the cluster whose R2 is furthest from the
average R2 of the permuted expression matrices.
67Relevance Networks
Set of genes whose expression profiles are
predictive of one another.
Can be used to identify negative correlations
between genes
Genes with low entropy (least variable across
experiments) are excluded from analysis.
68Relevance Networks
Tmin 0.50
The expression pattern of each gene compared to
that of every other gene.
The remaining relationships between genes define
the subnets
Tmax 0.90
Correlation coefficients outside the boundaries
defined by the minimum and maximum thresholds are
eliminated.
The ability of each gene to predict the
expression of each other gene is assigned a
correlation coefficient
69T-Tests (TTEST) Between subjects (or unpaired)
- 1
- Assign experiments to two groups, e.g., in the
expression matrix - below, assign Experiments 1, 2 and 5 to group A,
and - experiments 3, 4 and 6 to group B.
2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
70TTEST Between subjects - 2
3. Calculate t-statistic for each gene
4. Calculate probability value of the t-statistic
for each gene either from A. Theoretical
t-distribution OR B. Permutation tests.
71TTEST - Between subjects - 3
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Original grouping
Randomized grouping
72TTEST - Between subjects - 4
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi
) Then, the p-value associated with the gene 1
(x/n)
73TTEST - Between subjects - 5
- 5. Determine whether a genes expression levels
are significantly - different between the two groups by one of three
methods - Just alpha If the calculated p-value for a gene
is less than - or equal to the user-input alpha (critical
p-value), the gene is - considered significant.
-
- OR
- Use Bonferroni corrections to reduce the
probability of - erroneously classifying non-significant genes as
significant. - B) Standard Bonferroni correction The user-input
alpha is divided - by the total number of genes to give a critical
p-value that is used - as above.
74TTEST - Between subjects 6
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (alpha / N), where N is
the total number of genes for the gene with the
second-highest t-value, the critical p-value will
be (alpha/ N-1), and so on.
75TTEST 1-class (or One-sample t-test) - 1
- Used to test if the the mean expression of a gene
over all experiments is - different from a hypothesized mean.
Exp 1
Exp 2
Exp 3
Exp 4
Exp 5
Exp 6
Vector 1
Gene 1
Vector 2
Gene 2
Vector 3
Gene 3
2. Question Is the mean of the values of a given
gene vector significantly different from a
hypothesized mean?
76TTEST- 1 Class - 2
3. Often, the hypothesized mean in gene
expression studies is zero, meaning that we are
looking for genes whose mean log2 ratio across
all experiments is significantly different from
zero, i.e., 4. Using 1-sample t-tests, we can
select genes which, on average, show
differential expression across all experiments
(since genes with no differential expression
should have a mean log2 ratio of zero across all
expts). 5. Calculate t-value, where
Observed mean of gene vector Hypothesized mean
of gene vector t ----------------------------
--------------------------------------------------
Standard error of the mean of the gene vector
77TTEST 1 class - 3
6. Calculate p-value from a theoretical
t-distribution, OR 7. By permutation 7a.
Randomly pick some elements of the gene vector,
and change their values, such that the new value
of the changed element is original value 2
x (original value - hypothesized mean)
(i.e., flip the elements deviation around the
hypothesized mean) Thus, if the original gene
values are and the hypothesized mean is
zero, then the randomized gene values could
be
These elements were randomly chosen and flipped
around zero, the hypothesized mean
78TTEST 1 class - 4
7b. Calculate t-value from the randomized
gene 7c. Repeat 7a and 7b as many times as
desired. If all permutations are chosen, then
every possible combination of elements in the
gene vector is chosen for flipping. 7d. The
p-value 1 (the proportion of times that the
original absolute t-value exceeds the randomized
absolute t-value over all the permutations
conducted). 8. If a genes p-value is less than
or equal to the user-specified critical
p-value, the genes mean expression over all
experiments is significantly different from the
hypothesized mean. 9. Bonferroni and adjusted
Bonferroni corrections may be applied just as in
the two-sample t-test.
79One Way Analysis of Variance (ANOVA)
- Assign experiments to gt 2 groups
Group 2
Group 3
2. Question Is mean expression level of a gene
the same across all groups?
80ANOVA - 2
3. Calculate an F-ratio for each gene,
where Mean square (groups) F
--------------------------, which is a measure
of Mean square (error) Between groups
variability ---------------------------------
Within groups variability The larger the value
of F, the greater the difference among the group
means relative to the sampling error variability
(which is the within groups variability). i.e.,
the larger the value of F, the more likely it is
that the differences among the group means
reflect real differences among the means of the
populations they are drawn from, rather than
being due to random sampling error.
81 ANOVA - 3 4. The p-value associated with an
F-value is the probability that an F-value that
large would be obtained if there were no
differences among group means (i.e., given the
null hypothesis). Therefore, the smaller the
p-value, the less likely it is that the null
hypothesis is valid, i.e., the differences among
group means are more likely to reflect real
population differences as p-values decrease in
magnitude.
82- ANOVA - 4
- 5. P-values can be obtained for the F-values from
a theoretical F-distribution, assuming that the
populations from which the data are obtained - are normally distributed, and
- have homogeneous variances.
The test is considered robust to violations of
these assumptions, provided sample sizes are
relatively large and similar across groups.
83 ANOVA 5 6. P-values can be obtained from
permutation tests (just like in t-tests), if one
does not want to rely on the assumptions needed
for using the F-distribution. P-values can also
be corrected for multiple comparisons (using
Bonferroni or other procedures). These features
will soon be implemented in MeV.
84Two-factor ANOVA (TFA)
- Can be used to find genes whose expression is
significantly - different over two factors (e.g., sex and
strain), as well as to - look for genes with a significant interaction for
these two - factors.
Strain B
Strain C
Strain A
Male
Female
85TFA - 2
86TFA - 3
- Ideally, design should be balanced, i.e., equal
numbers of samples - in each factor A factor B combination.
- If unbalanced, the analysis can still be
conducted, but F-tests will - be somewhat biased. May need to use smaller
p-values. - can have balanced designs with no replication
(see below). In this - case, interaction cannot be tested..
87Significance analysis of microarrays (SAM)
- SAM can be used to pick out significant genes
based on differential expression between sets of
samples. - Currently implemented for the following designs
- - two-class unpaired
- two-class paired
- multi-class
- censored survival
- one-class
88SAM -2
- SAM gives estimates of the False Discovery Rate
(FDR), which is the proportion of genes likely to
have been wrongly identified by chance as being
significant. - It is a very interactive algorithm allows users
to dynamically change thresholds for significance
(through the tuning parameter delta) after
looking at the distribution of the test
statistic. - The ability to dynamically alter the input
parameters based on immediate visual feedback,
even before completing the analysis, should make
the data-mining process more sensitive.
89SAM designs
- Two-class unpaired to pick out genes whose mean
expression level is significantly different
between two groups of samples (analogous to
between subjects t-test). - Two-class paired samples are split into two
groups, and there is a 1-to-1 correspondence
between an sample in group A and one in group B
(analogous to paired t-test).
90SAM designs - 2
- Multi-class picks up genes whose mean expression
is different across gt 2 groups of samples
(analogous to one-way ANOVA) - Censored survival picks up genes whose
expression levels are correlated with duration of
survival. - One-class picks up genes whose mean expression
across experiments is different from a
user-specified mean.
91SAM Two-Class Unpaired
- Assign experiments to two groups, e.g., in the
expression matrix - below, assign Experiments 1, 2 and 5 to group A,
and - experiments 3, 4 and 6 to group B.
2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
92SAM Two-Class Unpaired 2
Permutation tests
- For each gene, compute d-value (analogous to
t-statistic). This is - the observed d-value for that gene.
- ii) Rank the genes in ascending order of their
d-values.
iii) Randomly shuffle the values of the genes
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Compute the d-value for each randomized gene
Original grouping
Randomized grouping
93SAM Two-Class Unpaired - 3
iv) Rank the permuted d-values of the genes in
ascending order
v) Repeat steps iii) and iv) many times, so that
each gene has many randomized d-values
corresponding to its rank from the
observed (unpermuted) d-value. Take the average
of the randomized d-values for each gene. This
is the expected d-value of that gene.
vi) Plot the observed d-values vs. the expected
d-values
94SAM Two-Class Unpaired 4
95SAM Two-Class Unpaired 5
- For each permutation of the data, compute the
number of positive and negative significant genes
for a given delta as explained in the previous
slide. The median number of significant genes
from these permutations is the median False
Discovery Rate. - The rationale behind this is, any genes
designated as significant from the randomized
data are being picked up purely by chance (i.e.,
falsely discovered). Therefore, the median
number picked up over many randomizations is a
good estimate of false discovery rate.
96SAM Two-Class Paired
- Samples fall into two groups
- Each member of group A is associated with a
member of - group B in a 1-to-1 relationship
A-B pair
97SAM Two-Class Paired - 2
- e.g., groups A and B could respectively represent
before and after a drug treatment, and each
A-B pair of samples could come from the same
patient before and after the treatment. - or, groups A and B could represent two strains
for which samples were collected at the several
time points over a time course study. A sample
collected from each of strain A and B at the same
time point could form an AB pair.
- The rest of the analysis is similar to two-class
unpaired SAM. Positive significant genes are
those for which Mean(Group B) is significantly
larger than Mean (Group A), and reverse is true
for negative significant genes
98SAM Multi-Class
- Extension of SAM two -class unpaired to more
than 2 groups -
- Experiments belong to one of at least three
groups - Analogous to one-way between subjects ANOVA
Group 2
Group 3
99SAM Multi-Class - 2
- This analysis yields only positive significant
genes - These are genes whose means are significantly
different across - some combination of the groups of experiments.
100SAM Censored Survival
- Each experiment (sample) is associated with an
observation - time, and a state at the time of observation.
- The state is either dead or censored
- Censored means that the subject survived
beyond the time - point at which the sample was taken.
- A positive score means that a higher expression
level for that - gene implies shorter survival (i.e., higher
risk), whereas a - negative score means that higher expression
implies longer - survival.
101SAM One-Class
- used to pick up genes whose mean expression
across experiments - is different from a user-specified mean.
- analogous to one-class t-test
-
- positive genes are those whose means are greater
than the specified - mean, while negative genes have means smaller
than the specified - mean
102Support Vector Machines (SVM)
- supervised learning technique
- uses supplied information such as presumptive
biological relationships between a set of
elements, and the expression profiles of elements
to produce a binary classification of elements.
103Supervised Learning
-begins with the definition of a class which
specifies in advance which elements should
cluster together. -ie. genes for enzymes in a
common pathway or part of a regulatory system, or
samples may be a tissue type or from a particular
strain. -this information is used to train the
SVM to discriminate members from non-members
104SVM Process Overview
SVM Training
SVM Classification
Elements In Classification
Elements Out of Classification
105SVM Classification
- SVM attempts to find an optimal separating
hyperplane between members of the two initial
classifications.
Separating hyperplane
106Separation Problem
-an optimal hyperplane partitions the initial
classification correctly and maximizes distance
from the plane to elements on either side,
positive and negative examples. -when the
training examples (initial classification)
consists of very diverse expression patterns
finding an optimal hyperplane can be impossible
107SVM Kernel Construction
- The expression data can be transformed to a
higher dimensional space (feature space) by
applying a kernel function. This transformation
can have the effect of allowing a separating
hyperplane to be found. -
108Practical SVM Issues
- Results depend heavily on the input parameters.
- Using a high degree kernel function risks
artificial separation of the data. - An iterative approach to increasing the kernel
power is advisable.
109SVM Results
- Two classes are produced
- Positive Class contains elements with expression
patterns similar to those in the positive
examples in the training set. - Negative Class contains all other members of the
input set. - Each of these classes has elements that fall in
two groups - Those initially in the class (true positives and
true negatives) - Those recruited into the class (false positives
and false negatives)
110K-Nearest Neighbor Classification KNNC - 1
- supervised classification scheme
- user specifies the number of expected classes
- a training set of vectors is provided as input
- user specifies classes of training vectors
- training set should contain example of each
class
111KNNC 2 pre-classification filters
- Prior to classification, variance filtering can
optionally be applied - to all vectors (training set vectors to be
trained). This will filter - out genes with low variance across experiments.
Note that this - might filter out some genes in the training set
as well. - Correlation filtering can also be applied on the
vectors to be - classified. This would filter out those vectors
in the set to be - classified, that are not significantly correlated
with any gene in the - training set.
- Significance for correlation filtering is
determined by a - permutation test.
112KNNC 3 - correlation filtering randomization
test
1. The Pearson correlation coefficient r is
computed between a given vector to be classified,
and each member of the training set 2. The
maximum such r is called the rmax for that
vector. 3. The vector is randomized a
user-specified number of times, and each time, an
rmax is calculated using the randomized
vector (call it rmax), just as in steps 1 and
2. 4. The proportion of times rmax exceeds
rmax over all randomizations is the p-value for
that vector. 5. If the p-value for a vector lt
the user-specified p-value, that vector is
retained for further analysis. 6. Steps 1-6 are
repeated for every vector in the set to be
classified.
113KNNC 4 - Classification parameters
- Let v be a vector that needs to be classified,
- and T t1, t2, , t10 be the set of training
vectors. - The user specifies the classes of each element
of T. Say, there - are 4 classes.
- The user also specifies the number of neighbors
k. Say, k 5.
114KNNC 5 - Classification
- Suppose vs 5 nearest neighbors in set T (by
Euclidean distance) are - t1, t4, t8, t2, and t5.
- Since class 1 is most frequently represented in
vs nearest neigbors, v is assigned - to class 1.
- If there is a tie in frequency of classes
represented among nearest neighbors, the - vector remains unassigned.
115EASE(Expression Analysis Systematic Explorer)
EASE analysis identifies prevalent biological
themes within gene clusters. The significance of
each identified theme is determined by its
prevalence in the cluster and in the gene
population of genes from which the cluster was
created.
116Diverse Biological Roles
Consider a population of genes representing a
diverse set of biological roles or themes shown
below as different colors.
117Many algorithms can be applied to expression data
to partition genes based on expression profiles
over multiple conditions. Many of these
techniques work solely on expression data and
disregard biological information.
118Consider a particular cluster
-What are the some of the predominant biological
themes represented in the cluster and how should
significance be assigned to a discovered
biological theme?
119Example Population Size 40 genes Cluster
size 12 genes 10 genes, shown in green, have a
common biological theme and 8 occur within the
cluster.
120Consider the Outcome
AND
80 of the genes related to the theme in the
population ended up within the relatively small
cluster.
121Contingency Matrix
A 2x2 contingency matrix is typically used to
capture the relationships between cluster
membership and membership to a biological theme.
122(No Transcript)
123Assigning Significance to the Findings
The Fishers Exact Test permits us to determine
if there are non-random associations between the
two variables, expression based cluster
membership and membership to a particular
biological theme.
Cluster
in
out
in
p ? .0002
Theme
out
( 2x2 contingency matrix )
124Hypergeometric Distribution
The probability of any particular matrix
occurring by random selection, given no
association between the two variables, is
given by the hypergeometric rule.
125Probability Computation
, we are not only
For our matrix,
interested in getting the probability of getting
exactly 8 annotation hits in the cluster but
rather the probability of having 8 or more hits.
In this case the probabilities of each of the
possible matrices is summed.
.0002207 7.27x10-6 7.79x10-8 ? .000228
126EASE Results
- Consider all of the Results
- EASE reports all themes represented in a cluster
and although some themes may not meet statistical
significance it may still be important to note
that particular biological roles or pathways are
represented in the cluster. - Independently Verify Roles
- Once found, biological themes should be
- independently verified using annotation resources.
127Basic EASE Requirements
Annotation keys identifiers for each gene must
be loaded with the data into MeV. EASE file
system EASE uses a file system to link
annotation keys to biological themes.
128EASE File System
129EASE(Expression Analysis Systematic Explorer)
Hosack et al. Identifying biological themes
within lists of genes with EASE. Genome Biol.,
4R70-R70.8, 2003.
NIAID graciously provided the foundation Java
classes upon which the MeV version was built.
130Coming Attractions
- Algorithm scripting
- Discriminant analysis
- Chromosome Viewers
- etc.