Microarrays: Common Analysis Approaches

About This Presentation

Title:

Microarrays: Common Analysis Approaches

Description:

Clustering Algorithms. Principal Components Analysis. Outline ... It is a very interactive algorithm allows users to dynamically change ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 105

Provided by: PhilippeR4

Category:

more less

Transcript and Presenter's Notes

Title: Microarrays: Common Analysis Approaches

1
MicroarraysCommon Analysis Approaches
2
Outline

Missing Value Estimation
Differentially Expressed Genes
Clustering Algorithms
Principal Components Analysis

3
Missing Data Outline

Missing data problem, basic concepts and
terminology
Classes of procedures
Case deletion
Single imputation
Filling with zeroes
Row averaging
SVD imputation
KNN imputation
Multiple imputation

4
The Missing Data Problem

Causes for missing data
Low resolution
Image corruption
Dust/scratched slides
Missing measurements
Why estimate missing values?
Many algorithms cannot deal with missing values
Distance measure-dependent algorithms(e.g.,
clustering, similarity searches)

5
Basic concepts and terminology
Statistical overview
Missing data mechanism
Sample of complete data ?s
Sample of incomplete data ?i
Population of complete data ?
Sample
Need to estimate ? from the incomplete data and
investigate its performance over repetitions of
the sampling procedure
6
Basic concepts
Y sample data f(Y?) distribution
of sample data ? parameters to be
estimated R indicators, whether
elements of Y are observed or missing g(RY)
missing data mechanism (maybe with other
params) Y (Yobs, Ymis) Yobs observed part of
Y Ymis missing part of Y Goal Propose
methods to estimate ? from Yobs and accurately
assess its error
7
Basic concepts (cont.)

Classes of mechanisms (cf. Rubin, 1976,
Biometrika)
Missing Completely At Random (MCAR)
g(RY) does not depend on Y
Missing At Random (MAR)
?g(RY) may depend on Yobs but not on Ymis
Missing Not At Random (MNAR)
?g(RY) depends on Ymis

8
Example

Suppose we measure age and income of a collection
of individuals
MCAR
The dog ate the response sheets!
MAR
Probability that the income measurement is
missing varies according to the age but not
income
MNAR
Probability that an income is recorded varies
according to the income level with each age group

Note we can disprove MCAR by examining the data,
but we cannot disprove MAR or MNAR.
9
Outline

Missing data problem, basic concepts and
terminology
Classes of procedures
Case deletion
Single imputation
Filling with zeroes
Row averaging
SVD imputation
KNN imputation
Multiple imputation

10
Classes of procedures Case Deletion

Remove subjects with missing values on any item
needed for analysis

Advantages
Easy
Valid analysis under MCAR
OK if proportion of missing cases is small and
they are not overly influential
Disadvantages
Can be inefficient, may discard a very high
proportion of cases (5669 out of 6178 rows
discarded in Spellman yeast data)
May introduce substantial bias, if missing data
are not MCAR (complete cases may be
un-representative of the population)

11
Classes of procedures Single Imputation (I)

Replace with zeroes
Fill-in all missing values with zeroes

Advantages
Easy
Disadvantages
Distorts the data disproportionately (changes
statistical properties)
May introduce bias
Why zero?

12
Classes of procedures Single Imputation (II)

Row averaging
Replace missing values by the row average for
that row

Advantages
Easy
Keeps same mean
Disadvantages
Distorts distributions and relationships between
variables

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
13
Classes of procedures Single Imputation (III)

Hot deck imputation
Replace each missing value by a randomly drawn
observed value

Advantages
Easy
Preserves distributions very well
Disadvantages
May distort relationships
Can use, e.g., similar rows to draw random
values from (to help constrain distortion)
Depend on definition of similar

14
Classes of procedures Single Imputation (IV)

Regression imputation
Fit regression to observed values, use it to
obtain predictions for missing ones

SVD imputation
Fill missing entries with regressed values from a
set of characteristic patterns, using
coefficients determined by the proximity of the
missing row to the patterns
KNN imputation (more later)
Isolate rows whose values are similar to those of
the one with missing values (choosing (i)
similarity measure, and (ii) size of this set)
Fill missing values with averages from this set
of genes, with weights inversely proportional to
similarities

Computationally intensive
May distort relationships between variables
(could use Yimprandom residual)

15
Classes of procedures Multiple Imputation

Main Idea
Replace Ymis by Mgt1 independent draws
Y1mis,, YMmis P(Ymis Yobs )
Produce M different versions of complete data
Analyse each one in same fashion and combine
results at the end, with standard error estimates
(Rubin, 1987)
More difficult to implement
Requires (initially) more computations
More work involved in interpreting results

16
KNN Imputation

Troyanskaya et al., Bioinformatics, 2001

The Algorithm
0. Given gene A with missing values
Find K other genes with values present in
experiment 1, with expression most similar to A
in other experiments
Weighted average of values in experiment 1 from
the K closest genes is used as an estimate for
the missing value in A

17
KNN Imputation Considerations

K the number of nearest neighbours
Method appears to be relatively insensitive to K
within the range 10-20
Distance metric to be used for computing gene
similarity
Troyanskaya Euclidean is sufficient
No clear comparison or reason would expect that
metric to be used depends on the type of
experiment
Not recommended on matrices with less than four
columns
Computationally intensive!
O(m2n) for m rows and n genes
3.23 minutes on a Pentium III 500 MHz for 6153
genes, 14 experiments with 10 of the entries
missing

18
KNN Imputation Expression Profiler
19
Outline

Missing Value Estimation
Differentially Expressed Genes
Clustering Algorithms
Principal Components Analysis

20
Identifying Differentially Expressed Genes
Slides courtesy of John Quackenbush, TIGR
21
Two vs. Multiple conditions

Two conditions - t-test - Significance
analysis of microarrays (SAM) - Volcano Plots
- ANOVA
Multiple conditions - Clustering - K-means
- PCA

22
How Many Replicates??
n 4(za/2 zb)2 / (d/1.4s)2
Where za/2 and zb are normal percentile values
at false positive rate a ?Type I error
ratefalse negative rate b ?Type II error
rate, d represents the minimum detectable log2
ratio and s represents the SD of log ratio
values. For a 0.001 and b 0.05, get za/2
-3.29 and zb -1.65. Assume d 1.0 (2-fold
change) and s 0.25, ? n 12 samples (6 query
and 6 control) ?
(Simon et al., Genetic Epidemiology 23 21-36,
2002)
23
Some Concepts from Statistics
24
Probability Distributions

The probability of an event is the likelihood
of its occurring.
It is sometimes computed as a relative frequency
(rf), where

The probability of an event can sometimes be
inferred from a theoretical probability
distribution, such as a normal distribution.
25
Normal Distribution
26

Less than a 5 chance that the sample with mean
s came from Population 1
s is significantly different from Mean 1 at the
p lt 0.05 significance level.
But we cannot reject the hypothesis that the
sample came from Population 2

27
Probability and Expression Data

Many biological variables, such as height and
weight, can reasonably be assumed to approximate
the normal distribution.
But expression measurements? Probably not.
Fortunately, many statistical tests are
considered to be fairly robust to violations of
the normality assumption, and other assumptions
used in these tests.
Randomization / resampling based tests can be
used to get around the violation of the normality
assumption.
Even when parametric statistical tests (the ones
that make use of normal and other distributions)
are valid, randomization tests are still useful.

28
Outline of a Randomisation Test
1. Compute the value of interest (i.e., the
test-statistic s) from your data set.
2. Make fake data sets from your original
data, by taking a random sub-sample of the data,
or by re-arranging the data in a random fashion.
Re-compute s from the fake data set.
29
Outline of a Randomisation Test (II)
3. Repeat step 2 many times (often several
hundred to several thousand times) and record of
the fake s values from step 2 4. Draw
inferences about the significance of your
original s value by comparing it with the
distribution of the randomized (fake) s values
30
Outline of a Randomisation Test (III)

Rationale
Ideally, we want to know the behavior of the
larger population from which the sample is drawn,
in order to make statistical inferences.
Here, we dont know that the larger population
behaves like a normal distribution, or some
other idealized distribution. All we have to work
with are the data in hand.
Our fake data sets are our best guess about
this behavior (i.e., if we had been pulling data
at random from an infinitely large population, we
might expect to get a distribution similar to
what we get by pulling random sub-samples, or by
reshuffling the order of the data in our sample)

31
The Problem of Multiple Testing (I)

Lets imagine there are 10,000 genes on a chip,
and
none of them is differentially expressed.
Suppose we use a statistical test for
differential expression, where we consider a gene
to be differentially expressed if it meets the
criterion at a p-value of p lt 0.05.

32
The Problem of Multiple Testing (II)

Lets say that applying this test to gene G1
yields a p-value of p 0.01
Remember that a p-value of 0.01 means that there
is a 1 chance that the gene is not
differentially expressed, i.e.,
Even though we conclude that the gene is
differentially expressed (because p lt 0.05),
there is a 1 chance that our conclusion is
wrong.
We might be willing to live with such a low
probability of being wrong
BUT .....

33
The Problem of Multiple Testing (III)

We are testing 10,000 genes, not just one!!!
Even though none of the genes is differentially
expressed, about 5 of the genes (i.e., 500
genes) will be erroneously concluded to be
differentially expressed, because we have decided
to live with a p-value of 0.05
If only one gene were being studied, a 5 margin
of error might not be a big deal, but 500 false
conclusions in one study? That doesnt sound too
good.

34
The Problem of Multiple Testing (IV)

There are tricks we can use to reduce the
severity of this problem.
They all involve slashing the p-value for each
test (i.e., gene), so that while the critical
p-value for the entire data set might still equal
0.05, each gene will be evaluated at a lower
p-value.
Well go into some of these techniques later.

35
The Problem of Multiple Testing (V)

Dont get too hung up on p-values.
Ultimately, what matters is biological relevance.
P-values should help you evaluate the strength of
the evidence, rather than being used as an
absolute yardstick of significance.
Statistical significance is not necessarily the
same as biological significance.

36
Finding Significant Genes

Assume we will compare two conditions with
multiple replicates for each class
Our goal is to find genes that are significantly
different between these classes
These are the genes that we will use for later
data mining

37
Finding Significant Genes (II)

Average Fold Change Difference for each gene
suffers from being arbitrary and not taking into
account systematic variation in the data

38
Finding Significant Genes (III)

t-test for each gene
Tests whether the difference between the mean of
the query and reference groups are the same
Essentially measures signal-to-noise
Calculate p-value (permutations or distributions)
May suffer from intensity-dependent effects

39
T-Tests
40
T-Tests (I)

Assign experiments to two groups, e.g., in the
expression matrix below, assign Experiments 1, 2
and 5 to group A, and experiments 3, 4 and 6 to
group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
41
T-Tests (II)
3. Calculate t-statistic for each gene
4. Calculate probability value of the
t-statistic for each gene either from A.
Theoretical t-distribution OR B.
Permutation tests.
42
T-Tests (III)
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
43
T-Tests (IV)
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi)
Then, the p-value associated with the gene 1
(x/n)
44
T-Tests (V)

5. Determine whether a genes expression levels
are significantly different between the two
groups by one of three methods
Just alpha (a significance level) If the
calculated p-value for a gene is less than or
equal to the user-input a (critical p-value), the
gene is considered significant.
OR
Use Bonferroni corrections to reduce the
probability of erroneously classifying
non-significant genes as significant.
B) Standard Bonferroni correction The user-input
alpha is divided by the total number of genes to
give a critical p-value that is used as above gt
pcritical a/N.

45
T-Tests (VI)
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (a/N), where N is the
total number of genes for the gene with the
second-highest t-value, the critical p-value
will be (a/N-1), and so on.
46
Finding Significant Genes (IV)

Significance Analysis of Microarrays (SAM)- Uses
a modified t-test by estimating and adding a
small positive constant to the denominator-
Significant genes are those which exceed the
expected values from permutation analysis.

47
SAM

SAM can be used to select significant genes based
on differential expression between sets of
conditions
Currently implemented for two-class unpaired
design i.e., we can select genes whose mean
expression level is significantly different
between two groups of samples (analogous to
t-test).
Stanford University, Rob Tibshiranihttp//www-sta
t.stanford.edu/tibs/SAM/index.html

48
SAM

SAM gives estimates of the False Discovery Rate
(FDR), which is the proportion of genes likely to
have been wrongly identified by chance as being
significant.
It is a very interactive algorithm allows users
to dynamically change thresholds for significance
(through the tuning parameter delta) after
looking at the distribution of the test
statistic.
The ability to dynamically alter the input
parameters based on immediate visual feedback,
even before completing the analysis, should make
the data-mining process more sensitive.

49
SAM Two-class

Assign experiments to two groups - in the
expression matrix below Experiments 1, 2 and 5
to group A Experiments 3, 4 and 6 to group B

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
50
SAM Two-class
Permutation tests
i) For each gene, compute d-value (analogous to
t-statistic). This is the observed d-value for
that gene.
ii) Randomly shuffle the values of the gene
between groups A and B, such that the
reshuffled groups A and B have the same
number of elements as the original groups A and
B. Compute the d-value for each randomized
gene
51
SAM Two-class

Repeat step (ii) many times, so that each gene
has many randomized d-values. Take the average of
the randomized d-values for each gene. This is
the expected d-value of that gene.
Plot the observed d-values vs. the expected
d-values

52
SAM Two-class
The more a gene deviates from the observed
expected line, the more likely it is to be
significant. Any gene beyond the first gene in
the ve or ve direction on the x-axis (including
the first gene), whose observed exceeds the
expected by at least delta, is considered
significant.
53
SAM Two-class

For each permutation of the data, compute the
number of positive and negative significant genes
for a given delta. The median number of
significant genes from these permutations is the
median False Discovery Rate.
The rationaleAny gene designated as significant
from the randomized data are being picked up
purely by chance (i.e., falsely discovered).
Therefore, the median number picked up over many
randomisations is a good estimate of false
discovery rate.

54
Finding Significant Genes (V)
Volcano Plots

Effect vs. Significance
Selections of items that have both a large effect
and are highly significant can be identified
easily.

55
Volcano Plots
Using log10 for Y axis

Using log2 for X axis
56
Volcano Plots (II)
Using log10 for Y axis

Using log2 for X axis
57
Finding Significant Genes (VI)

Analysis of Variation (ANOVA)- Which genes are
most significant for separating classes of
samples?- Calculate p-value (permutations or
distributions)- Reduces to a t-test for 2
samples- May suffer from intensity-dependent
effects

58
Multiple Conditions/Experiments

Goal is to identify genes (or conditions) which
havesimilar patterns of expression
This is a problem in data mining
Clustering Algorithms are most widely used
All depend on how one measures distance

59
Pattern analysis
Pattern analysis
60
Expression Vectors

Each gene is represented by a vector where
coordinates are its values log(ratio) in each
experiment
- x log(ratio)exp1
- y log(ratio)exp2- z log(ratio)exp3
- etc.

61
Expression Vectors

Each gene is represented by a vector where
coordinates are its values log(ratio) in each
experiment
- x log(ratio)exp1
- y log(ratio)exp2
- z log(ratio)exp3
- etc.
For example, if we do six experiments,
- Gene1 (-1.2, -0.5, 0, 0.25, 0.75, 1.4)
- Gene2 (0.2, -0.5, 1.2, -0.25, -1.0, 1.5)
- Gene3 (1.2, 0.5, 0, -0.25, -0.75, -1.4)
- etc.

62
Expression Matrix
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

These gene expression vectors of log(ratio)
values can be used to construct an expression
matrix

Gene1 -1.2 -0.5 0 0.25 0.75 1.4
Gene2 0.2 -0.5 1.2 -0.25 -1.0 1.5
Gene3 1.2 0.5 0 -0.25 -0.75 -1.4
This is often represented as a red/green colored
matrix

63
Expression Matrix
The Expression Matrix is a representation of data
from multiple microarray experiments.
Each element is a log ratio, usually log 2
(Cy5/Cy3)
Black indicates a log ratio of zero ( Cy5 Cy3
)
Green indicates a negative log ratio ( Cy5 lt Cy3
)
Gray indicates missing data
Red indicates a positive log ratio ( Cy5 gt Cy3 )
64
Expression Vectors as points inExpression Space
65
Distance measures

Distances are measured between expression
vectors
Distance measures define the way we measure
distances
Many different ways to measure distance
- Euclidean distance- Manhattan distance
- Pearson correlation- Spearman correlation
- etc.
Each has different properties and can reveal
different features of the data

66
Euclidean distance

Measures the 'as-the-crow-flies' distance
Deriving the Euclidean distance between two data
points involves computing the square root of
the sum of the squares of the differences
between corresponding values ( Pythagoras
theorem )

67
Manhattan distance

Computes the distance that would be traveled to
get from one data point to the other if a
grid-like path is followed
Manhattan distance between two items is the sum
of the differences of their corresponding
components

68
Pearson and Pearson squared

Pearson Correlation measures the similarity in
shape between two profiles
Pearson Squared distance measures the similarity
in shape between two profiles, but can also
capture inverse relationships

69
Spearman Rank Correlation

Spearman Rank Correlation measures the
correlation between two sequences of values.
The two sequences are ranked separately and the
differences in rank are calculated at each
position, i.
Use Spearman Correlation to cluster together
genes whose expression profiles have similar
shapes or show similar general trends, but
whose expression levels may be very different

Where Xi and Yi are the ith values of sequences X
and Y respectively
70
Distance Matrix
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6

Once a distance metric has been selected, the
starting point for all clustering methods is a
distance matrix

Gene1 0 1.5 1.2 0.25 0.75 1.4
Gene2 1.5 0 1.3 0.55 2.0 1.5
Gene3 1.2 1.3 0 1.3 0.75 0.3
Gene4 0.25 0.55 1.3 0 0.25 0.4
Gene5 0.75 2.0 0.75 0.25 0 1.2
Gene6 1.4 1.5 0.3 0.4 1.2 0
The elements of this matrix are the pair-wise
distances. ( matrix is symmetric around the
diagonal )

71
Hierarchical Clustering
1. Calculate the distance between all genes. Find
the smallest distance. If several pairs share the
same similarity, use a predetermined rule to
decide between alternatives.
2. Fuse the two selected clusters to produce a
new cluster that now contains at least two
objects. Calculate the distance between the new
cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single
cluster remains.
4. Draw a tree representing the results.
72
Hierarchical Clustering
73
Hierarchical Clustering
74
Hierarchical Tree
75
Agglomerative Linkage Methods

Linkage methods are rules that determine which
elements (clusters) should be linked.
Three linkage methods that are commonly used
- Single Linkage - Average Linkage -
Complete Linkage

76
Single Linkage
Cluster-to-cluster distance is defined as the
minimum distance between members of one cluster
and members of another cluster. Single linkage
tends to create elongated clusters with
individual genes chained onto clusters. DAB
min ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
77
Average Linkage
Cluster-to-cluster distance is defined as the
average distance between all members of one
cluster and all members of another cluster.
Average linkage has a slight tendency to produce
clusters of similar variance. DAB 1/(NANB) S
S ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
78
Complete Linkage
Cluster-to-cluster distance is defined as the
maximum distance between members of one cluster
and members of the another cluster. Complete
linkage tends to create clusters of similar size
and variability. DAB max ( d(ui, vj) ) where
u Î A and v Î B for all i 1 to NA and j 1 to
NB
79
Comparison of Linkage Methods
Average
Single
Complete
80
K-Means/Medians Clustering
81
K-Means/Medians Clustering
3. Calculate mean/median expression profile of
each cluster
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean expression
profile (calculated in step 3) is the closest to
that genes expression profile
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached
K-Means is most useful when the user has an a
priori hypothesis about the number of clusters
the genes should group into.
82
Clustering Comparison

MOTIVATION Using different clustering methods
often produces different results. How do these
clustering results relate to each other?
? Clustering comparison method that finds a
many-to-many correspondence in two different
clustering results.
comparison of two flat clusterings
comparison of a flat and a hierarchical
clustering.

83
Comparison of flat clusterings
C1 A1, A2, A3 , A 4
A2
A1
A3
A4
84
Indices to measure the overlapping

Intersection size
Simpsons index
Jaccard index

85
Comparison of flat and hierarchical clusterings
Selecting a point to cut the dendogram leads to s
disjoint groups.
86
Results

ARTIFICIAL DATA Four data sets with four
clusters, constructed with the same four seeds
and different levels of noise.
1000 genes, 10 conditions
d 20 initial partitions

87
(No Transcript)
88
(No Transcript)
89
Visualisation in Expression Profiler
90
Outline

Missing Value Estimation
Differentially Expressed Genes
Clustering Algorithms
Principal Components Analysis

91
PCA (Dimensionality Reduction Methods)
92
Outline

Dimensionality Problem
Techniques Methods
Multidimensional Scaling
Eigenanalysis-based ordination methods
Principal Component Analysis (PCA)
Correspondence Analysis (CA)

93
Dimensionality problem

Problem?
Curse of dimensionality
Convergence of any estimator to the true value of
a smooth function on a space of high dimension is
very slow
In other words, need many observations to obtain
a good estimate of gene function
Blessing? very few things really matter
Solutions
Statistical techniques (corrections, etc.)
Reduce dimensionality
Ignore non-variable genes
Feature subset selection
Eliminate coordinates that are less relevant

94
Multidimensional Scaling
Idea place data in a low-dimensional space so
that similar objects are close to each other.

The Algorithm (roughly)
Assign points to arbitrary coordinates in
p-dimensional space.
Compute all-against-all distances, to form a
matrix D.
Compare D with the input matrix D by evaluating
the stress function. The smaller the value, the
greater the correspondence between the two.
Adjust coordinates of each point in the direction
that best maximizes stress.
Repeat steps 2 through 4 until stress won't get
any lower.

However
Computationally intensive
Axes are meaningless, orientation of the MDS map
is arbitrary
Difficult to interpret

95
Eigenanalysis Background
Basic Concepts An eigenvalue and eigenvector of a
square matrix A are a scalar ? and a nonzero
vector x so that Ax ?x
Q What is a matrix? A A linear
transformation. Q What are eigenvectors?A
Directions in which the transformation takes
place the most Exploratory example
EigenExplorer
96
Eigenanalysis Background
Finding eigenvalues Ax
?x (A ?I)x 0

Interpreting eigenvalues
Eigenvalues of a matrix provide a solid rotation
in the directions of highest variance
Can pick N largest eigenvalues, capture a large
proportion of the variance and represent every
value in the original matrix as a linear
combination of these values, e.g., xi a1?1 . .
. aN?N
Call this collection aj the eigengene/eigenarra
y (depending on which way we compute these)

97
PCA

PCA simplifies the views of the data.
Suppose we have measurements for each gene on
multiple experiments.
Suppose some of the experiments are correlated.
PCA will ignore the redundant experiments, and
will take a weighted average of some of the
experiments, thus possibly making the trends in
the data more interpretable.
5. The components can be thought of as axes in
n-dimensional space, where n is the number of
components. Each axis represents a different
trend in the data.

98
PCA
Data points resolved along 3 principal component
axes.
In this examplex-axis could mean a continuum
from over-to under-expression y-axis could mean
that blue genes are over-expressed in first
five expts and under expressed in the remaining
expts, while brown genes are under-expressed in
the first five expts, and over-expressed in the
remaining expts. z-axis might represent
different cyclic patterns, e.g., red genes
might be over-expressed in odd-numbered expts and
under-expressed in even-numbered ones, whereas
the opposite is true for purple
genes. Interpretation of components is somewhat
subjective.
99
(No Transcript)
100
(No Transcript)
101
y
x
Projecting the data into alower dimensional
spacecan help visualize relationships
102
y
x
Projecting the data into alower dimensional
spacecan help visualize relationships
103
PCA in Expression Profiler
104
Further Reading

MDS
http//www.analytictech.com/borgatti/mds.htm
PCA, SVD
http//www.statsoftinc.com/textbook/stfacan.html
http//linneus20.ethz.ch8080/2_2_1.html
Alter et al., Singular value decomposition for
genome-wide expression data processing and
modelling, PNAS, 2000
COA
Fellenberg et al., Correspondence analysis
applied to microarray data, PNAS, 2001
General ordination
http//www.okstate.edu/artsci/botany/ordinate/
Legendre P. and Legendre L., Numerical Ecology,
1998