Title: Microarrays: Common Analysis Approaches
1MicroarraysCommon Analysis Approaches
2Outline
- Missing Value Estimation
- Differentially Expressed Genes
- Clustering Algorithms
- Principal Components Analysis
3Missing Data Outline
- Missing data problem, basic concepts and
terminology - Classes of procedures
- Case deletion
- Single imputation
- Filling with zeroes
- Row averaging
- SVD imputation
- KNN imputation
- Multiple imputation
4The Missing Data Problem
- Causes for missing data
- Low resolution
- Image corruption
- Dust/scratched slides
- Missing measurements
- Why estimate missing values?
- Many algorithms cannot deal with missing values
- Distance measure-dependent algorithms(e.g.,
clustering, similarity searches)
5Basic concepts and terminology
Statistical overview
Missing data mechanism
Sample of complete data ?s
Sample of incomplete data ?i
Population of complete data ?
Sample
Need to estimate ? from the incomplete data and
investigate its performance over repetitions of
the sampling procedure
6Basic concepts
Y sample data f(Y?) distribution
of sample data ? parameters to be
estimated R indicators, whether
elements of Y are observed or missing g(RY)
missing data mechanism (maybe with other
params) Y (Yobs, Ymis) Yobs observed part of
Y Ymis missing part of Y Goal Propose
methods to estimate ? from Yobs and accurately
assess its error
7Basic concepts (cont.)
- Classes of mechanisms (cf. Rubin, 1976,
Biometrika) - Missing Completely At Random (MCAR)
- g(RY) does not depend on Y
- Missing At Random (MAR)
- ?g(RY) may depend on Yobs but not on Ymis
- Missing Not At Random (MNAR)
- ?g(RY) depends on Ymis
8Example
- Suppose we measure age and income of a collection
of individuals - MCAR
- The dog ate the response sheets!
- MAR
- Probability that the income measurement is
missing varies according to the age but not
income - MNAR
- Probability that an income is recorded varies
according to the income level with each age group
Note we can disprove MCAR by examining the data,
but we cannot disprove MAR or MNAR.
9Outline
- Missing data problem, basic concepts and
terminology - Classes of procedures
- Case deletion
- Single imputation
- Filling with zeroes
- Row averaging
- SVD imputation
- KNN imputation
- Multiple imputation
10Classes of procedures Case Deletion
- Remove subjects with missing values on any item
needed for analysis
- Advantages
- Easy
- Valid analysis under MCAR
- OK if proportion of missing cases is small and
they are not overly influential - Disadvantages
- Can be inefficient, may discard a very high
proportion of cases (5669 out of 6178 rows
discarded in Spellman yeast data) - May introduce substantial bias, if missing data
are not MCAR (complete cases may be
un-representative of the population)
11Classes of procedures Single Imputation (I)
- Replace with zeroes
- Fill-in all missing values with zeroes
- Advantages
- Easy
- Disadvantages
- Distorts the data disproportionately (changes
statistical properties) - May introduce bias
- Why zero?
12Classes of procedures Single Imputation (II)
- Row averaging
- Replace missing values by the row average for
that row
- Advantages
- Easy
- Keeps same mean
- Disadvantages
- Distorts distributions and relationships between
variables
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
13Classes of procedures Single Imputation (III)
- Hot deck imputation
- Replace each missing value by a randomly drawn
observed value
- Advantages
- Easy
- Preserves distributions very well
- Disadvantages
- May distort relationships
- Can use, e.g., similar rows to draw random
values from (to help constrain distortion) - Depend on definition of similar
14Classes of procedures Single Imputation (IV)
- Regression imputation
- Fit regression to observed values, use it to
obtain predictions for missing ones
- SVD imputation
- Fill missing entries with regressed values from a
set of characteristic patterns, using
coefficients determined by the proximity of the
missing row to the patterns - KNN imputation (more later)
- Isolate rows whose values are similar to those of
the one with missing values (choosing (i)
similarity measure, and (ii) size of this set) - Fill missing values with averages from this set
of genes, with weights inversely proportional to
similarities
- Computationally intensive
- May distort relationships between variables
(could use Yimprandom residual)
15Classes of procedures Multiple Imputation
- Main Idea
- Replace Ymis by Mgt1 independent draws
- Y1mis,, YMmis P(Ymis Yobs )
- Produce M different versions of complete data
- Analyse each one in same fashion and combine
results at the end, with standard error estimates
(Rubin, 1987) - More difficult to implement
- Requires (initially) more computations
- More work involved in interpreting results
16KNN Imputation
- Troyanskaya et al., Bioinformatics, 2001
- The Algorithm
- 0. Given gene A with missing values
- Find K other genes with values present in
experiment 1, with expression most similar to A
in other experiments - Weighted average of values in experiment 1 from
the K closest genes is used as an estimate for
the missing value in A
17KNN Imputation Considerations
- K the number of nearest neighbours
- Method appears to be relatively insensitive to K
within the range 10-20 - Distance metric to be used for computing gene
similarity - Troyanskaya Euclidean is sufficient
- No clear comparison or reason would expect that
metric to be used depends on the type of
experiment - Not recommended on matrices with less than four
columns - Computationally intensive!
- O(m2n) for m rows and n genes
- 3.23 minutes on a Pentium III 500 MHz for 6153
genes, 14 experiments with 10 of the entries
missing
18KNN Imputation Expression Profiler
19Outline
- Missing Value Estimation
- Differentially Expressed Genes
- Clustering Algorithms
- Principal Components Analysis
20Identifying Differentially Expressed Genes
Slides courtesy of John Quackenbush, TIGR
21Two vs. Multiple conditions
- Two conditions - t-test - Significance
analysis of microarrays (SAM) - Volcano Plots - - ANOVA
- Multiple conditions - Clustering - K-means
- PCA
22How Many Replicates??
n 4(za/2 zb)2 / (d/1.4s)2
Where za/2 and zb are normal percentile values
at false positive rate a ?Type I error
ratefalse negative rate b ?Type II error
rate, d represents the minimum detectable log2
ratio and s represents the SD of log ratio
values. For a 0.001 and b 0.05, get za/2
-3.29 and zb -1.65. Assume d 1.0 (2-fold
change) and s 0.25, ? n 12 samples (6 query
and 6 control) ?
(Simon et al., Genetic Epidemiology 23 21-36,
2002)
23Some Concepts from Statistics
24Probability Distributions
- The probability of an event is the likelihood
of its occurring. - It is sometimes computed as a relative frequency
(rf), where
The probability of an event can sometimes be
inferred from a theoretical probability
distribution, such as a normal distribution.
25Normal Distribution
26- Less than a 5 chance that the sample with mean
s came from Population 1 - s is significantly different from Mean 1 at the
p lt 0.05 significance level. - But we cannot reject the hypothesis that the
sample came from Population 2
27Probability and Expression Data
- Many biological variables, such as height and
weight, can reasonably be assumed to approximate
the normal distribution. - But expression measurements? Probably not.
- Fortunately, many statistical tests are
considered to be fairly robust to violations of
the normality assumption, and other assumptions
used in these tests. - Randomization / resampling based tests can be
used to get around the violation of the normality
assumption. - Even when parametric statistical tests (the ones
that make use of normal and other distributions)
are valid, randomization tests are still useful.
28Outline of a Randomisation Test
1. Compute the value of interest (i.e., the
test-statistic s) from your data set.
2. Make fake data sets from your original
data, by taking a random sub-sample of the data,
or by re-arranging the data in a random fashion.
Re-compute s from the fake data set.
29Outline of a Randomisation Test (II)
3. Repeat step 2 many times (often several
hundred to several thousand times) and record of
the fake s values from step 2 4. Draw
inferences about the significance of your
original s value by comparing it with the
distribution of the randomized (fake) s values
30Outline of a Randomisation Test (III)
- Rationale
- Ideally, we want to know the behavior of the
larger population from which the sample is drawn,
in order to make statistical inferences. - Here, we dont know that the larger population
behaves like a normal distribution, or some
other idealized distribution. All we have to work
with are the data in hand. - Our fake data sets are our best guess about
this behavior (i.e., if we had been pulling data
at random from an infinitely large population, we
might expect to get a distribution similar to
what we get by pulling random sub-samples, or by
reshuffling the order of the data in our sample)
31The Problem of Multiple Testing (I)
- Lets imagine there are 10,000 genes on a chip,
and - none of them is differentially expressed.
- Suppose we use a statistical test for
differential expression, where we consider a gene
to be differentially expressed if it meets the
criterion at a p-value of p lt 0.05.
32The Problem of Multiple Testing (II)
- Lets say that applying this test to gene G1
yields a p-value of p 0.01 - Remember that a p-value of 0.01 means that there
is a 1 chance that the gene is not
differentially expressed, i.e., - Even though we conclude that the gene is
differentially expressed (because p lt 0.05),
there is a 1 chance that our conclusion is
wrong. - We might be willing to live with such a low
probability of being wrong - BUT .....
33The Problem of Multiple Testing (III)
- We are testing 10,000 genes, not just one!!!
- Even though none of the genes is differentially
expressed, about 5 of the genes (i.e., 500
genes) will be erroneously concluded to be
differentially expressed, because we have decided
to live with a p-value of 0.05 - If only one gene were being studied, a 5 margin
of error might not be a big deal, but 500 false
conclusions in one study? That doesnt sound too
good.
34The Problem of Multiple Testing (IV)
- There are tricks we can use to reduce the
severity of this problem. - They all involve slashing the p-value for each
test (i.e., gene), so that while the critical
p-value for the entire data set might still equal
0.05, each gene will be evaluated at a lower
p-value. - Well go into some of these techniques later.
35The Problem of Multiple Testing (V)
- Dont get too hung up on p-values.
- Ultimately, what matters is biological relevance.
- P-values should help you evaluate the strength of
the evidence, rather than being used as an
absolute yardstick of significance. - Statistical significance is not necessarily the
same as biological significance.
36Finding Significant Genes
- Assume we will compare two conditions with
multiple replicates for each class - Our goal is to find genes that are significantly
different between these classes - These are the genes that we will use for later
data mining
37Finding Significant Genes (II)
- Average Fold Change Difference for each gene
- suffers from being arbitrary and not taking into
account systematic variation in the data
38Finding Significant Genes (III)
- t-test for each gene
- Tests whether the difference between the mean of
the query and reference groups are the same - Essentially measures signal-to-noise
- Calculate p-value (permutations or distributions)
- May suffer from intensity-dependent effects
39T-Tests
40T-Tests (I)
- Assign experiments to two groups, e.g., in the
expression matrix below, assign Experiments 1, 2
and 5 to group A, and experiments 3, 4 and 6 to
group B.
2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
41T-Tests (II)
3. Calculate t-statistic for each gene
4. Calculate probability value of the
t-statistic for each gene either from A.
Theoretical t-distribution OR B.
Permutation tests.
42T-Tests (III)
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
43T-Tests (IV)
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi)
Then, the p-value associated with the gene 1
(x/n)
44T-Tests (V)
- 5. Determine whether a genes expression levels
are significantly different between the two
groups by one of three methods - Just alpha (a significance level) If the
calculated p-value for a gene is less than or
equal to the user-input a (critical p-value), the
gene is considered significant. - OR
- Use Bonferroni corrections to reduce the
probability of erroneously classifying
non-significant genes as significant. - B) Standard Bonferroni correction The user-input
alpha is divided by the total number of genes to
give a critical p-value that is used as above gt
pcritical a/N.
45T-Tests (VI)
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (a/N), where N is the
total number of genes for the gene with the
second-highest t-value, the critical p-value
will be (a/N-1), and so on.
46Finding Significant Genes (IV)
- Significance Analysis of Microarrays (SAM)- Uses
a modified t-test by estimating and adding a
small positive constant to the denominator-
Significant genes are those which exceed the
expected values from permutation analysis.
47SAM
- SAM can be used to select significant genes based
on differential expression between sets of
conditions - Currently implemented for two-class unpaired
design i.e., we can select genes whose mean
expression level is significantly different
between two groups of samples (analogous to
t-test). - Stanford University, Rob Tibshiranihttp//www-sta
t.stanford.edu/tibs/SAM/index.html
48SAM
- SAM gives estimates of the False Discovery Rate
(FDR), which is the proportion of genes likely to
have been wrongly identified by chance as being
significant. - It is a very interactive algorithm allows users
to dynamically change thresholds for significance
(through the tuning parameter delta) after
looking at the distribution of the test
statistic. - The ability to dynamically alter the input
parameters based on immediate visual feedback,
even before completing the analysis, should make
the data-mining process more sensitive.
49SAM Two-class
- Assign experiments to two groups - in the
expression matrix below Experiments 1, 2 and 5
to group A Experiments 3, 4 and 6 to group B
2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
50SAM Two-class
Permutation tests
i) For each gene, compute d-value (analogous to
t-statistic). This is the observed d-value for
that gene.
ii) Randomly shuffle the values of the gene
between groups A and B, such that the
reshuffled groups A and B have the same
number of elements as the original groups A and
B. Compute the d-value for each randomized
gene
51SAM Two-class
- Repeat step (ii) many times, so that each gene
has many randomized d-values. Take the average of
the randomized d-values for each gene. This is
the expected d-value of that gene. - Plot the observed d-values vs. the expected
d-values
52SAM Two-class
The more a gene deviates from the observed
expected line, the more likely it is to be
significant. Any gene beyond the first gene in
the ve or ve direction on the x-axis (including
the first gene), whose observed exceeds the
expected by at least delta, is considered
significant.
53SAM Two-class
- For each permutation of the data, compute the
number of positive and negative significant genes
for a given delta. The median number of
significant genes from these permutations is the
median False Discovery Rate. - The rationaleAny gene designated as significant
from the randomized data are being picked up
purely by chance (i.e., falsely discovered).
Therefore, the median number picked up over many
randomisations is a good estimate of false
discovery rate.
54Finding Significant Genes (V)
Volcano Plots
- Effect vs. Significance
- Selections of items that have both a large effect
and are highly significant can be identified
easily.
55Volcano Plots
Using log10 for Y axis
Using log2 for X axis
56Volcano Plots (II)
Using log10 for Y axis
Using log2 for X axis
57Finding Significant Genes (VI)
- Analysis of Variation (ANOVA)- Which genes are
most significant for separating classes of
samples?- Calculate p-value (permutations or
distributions)- Reduces to a t-test for 2
samples- May suffer from intensity-dependent
effects
58Multiple Conditions/Experiments
- Goal is to identify genes (or conditions) which
havesimilar patterns of expression - This is a problem in data mining
- Clustering Algorithms are most widely used
- All depend on how one measures distance
59Pattern analysis
Pattern analysis
60Expression Vectors
- Each gene is represented by a vector where
coordinates are its values log(ratio) in each
experiment - - x log(ratio)exp1
- - y log(ratio)exp2- z log(ratio)exp3
- - etc.
61Expression Vectors
- Each gene is represented by a vector where
coordinates are its values log(ratio) in each
experiment - - x log(ratio)exp1
- - y log(ratio)exp2
- - z log(ratio)exp3
- - etc.
- For example, if we do six experiments,
- - Gene1 (-1.2, -0.5, 0, 0.25, 0.75, 1.4)
- - Gene2 (0.2, -0.5, 1.2, -0.25, -1.0, 1.5)
- - Gene3 (1.2, 0.5, 0, -0.25, -0.75, -1.4)
- - etc.
62Expression Matrix
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
- These gene expression vectors of log(ratio)
values can be used to construct an expression
matrix
- Gene1 -1.2 -0.5 0 0.25 0.75 1.4
- Gene2 0.2 -0.5 1.2 -0.25 -1.0 1.5
- Gene3 1.2 0.5 0 -0.25 -0.75 -1.4
- This is often represented as a red/green colored
matrix
63Expression Matrix
The Expression Matrix is a representation of data
from multiple microarray experiments.
Each element is a log ratio, usually log 2
(Cy5/Cy3)
Black indicates a log ratio of zero ( Cy5 Cy3
)
Green indicates a negative log ratio ( Cy5 lt Cy3
)
Gray indicates missing data
Red indicates a positive log ratio ( Cy5 gt Cy3 )
64Expression Vectors as points inExpression Space
65Distance measures
- Distances are measured between expression
vectors - Distance measures define the way we measure
distances - Many different ways to measure distance
- - Euclidean distance- Manhattan distance
- - Pearson correlation- Spearman correlation
- - etc.
- Each has different properties and can reveal
different features of the data
66Euclidean distance
- Measures the 'as-the-crow-flies' distance
- Deriving the Euclidean distance between two data
points involves computing the square root of
the sum of the squares of the differences
between corresponding values ( Pythagoras
theorem )
67Manhattan distance
- Computes the distance that would be traveled to
get from one data point to the other if a
grid-like path is followed - Manhattan distance between two items is the sum
of the differences of their corresponding
components
68Pearson and Pearson squared
- Pearson Correlation measures the similarity in
shape between two profiles - Pearson Squared distance measures the similarity
in shape between two profiles, but can also
capture inverse relationships
69Spearman Rank Correlation
- Spearman Rank Correlation measures the
correlation between two sequences of values.
- The two sequences are ranked separately and the
differences in rank are calculated at each
position, i. - Use Spearman Correlation to cluster together
genes whose expression profiles have similar
shapes or show similar general trends, but
whose expression levels may be very different
Where Xi and Yi are the ith values of sequences X
and Y respectively
70Distance Matrix
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
- Once a distance metric has been selected, the
starting point for all clustering methods is a
distance matrix
- Gene1 0 1.5 1.2 0.25 0.75 1.4
- Gene2 1.5 0 1.3 0.55 2.0 1.5
- Gene3 1.2 1.3 0 1.3 0.75 0.3
- Gene4 0.25 0.55 1.3 0 0.25 0.4
- Gene5 0.75 2.0 0.75 0.25 0 1.2
- Gene6 1.4 1.5 0.3 0.4 1.2 0
- The elements of this matrix are the pair-wise
distances. ( matrix is symmetric around the
diagonal )
71Hierarchical Clustering
1. Calculate the distance between all genes. Find
the smallest distance. If several pairs share the
same similarity, use a predetermined rule to
decide between alternatives.
2. Fuse the two selected clusters to produce a
new cluster that now contains at least two
objects. Calculate the distance between the new
cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single
cluster remains.
4. Draw a tree representing the results.
72Hierarchical Clustering
73Hierarchical Clustering
74Hierarchical Tree
75Agglomerative Linkage Methods
- Linkage methods are rules that determine which
elements (clusters) should be linked. - Three linkage methods that are commonly used
- Single Linkage - Average Linkage -
Complete Linkage
76Single Linkage
Cluster-to-cluster distance is defined as the
minimum distance between members of one cluster
and members of another cluster. Single linkage
tends to create elongated clusters with
individual genes chained onto clusters. DAB
min ( d(ui, vj) ) where u ÃŽ A and v ÃŽ B for all
i 1 to NA and j 1 to NB
77Average Linkage
Cluster-to-cluster distance is defined as the
average distance between all members of one
cluster and all members of another cluster.
Average linkage has a slight tendency to produce
clusters of similar variance. DAB 1/(NANB) S
S ( d(ui, vj) ) where u ÃŽ A and v ÃŽ B for all
i 1 to NA and j 1 to NB
78Complete Linkage
Cluster-to-cluster distance is defined as the
maximum distance between members of one cluster
and members of the another cluster. Complete
linkage tends to create clusters of similar size
and variability. DAB max ( d(ui, vj) ) where
u ÃŽ A and v ÃŽ B for all i 1 to NA and j 1 to
NB
79Comparison of Linkage Methods
Average
Single
Complete
80K-Means/Medians Clustering
81K-Means/Medians Clustering
3. Calculate mean/median expression profile of
each cluster
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean expression
profile (calculated in step 3) is the closest to
that genes expression profile
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached
K-Means is most useful when the user has an a
priori hypothesis about the number of clusters
the genes should group into.
82Clustering Comparison
- MOTIVATION Using different clustering methods
often produces different results. How do these
clustering results relate to each other? - ? Clustering comparison method that finds a
many-to-many correspondence in two different
clustering results. - comparison of two flat clusterings
- comparison of a flat and a hierarchical
clustering.
83Comparison of flat clusterings
C1 A1, A2, A3 , A 4
A2
A1
A3
A4
84Indices to measure the overlapping
- Intersection size
- Simpsons index
-
- Jaccard index
85Comparison of flat and hierarchical clusterings
Selecting a point to cut the dendogram leads to s
disjoint groups.
86Results
- ARTIFICIAL DATA Four data sets with four
clusters, constructed with the same four seeds
and different levels of noise. - 1000 genes, 10 conditions
- d 20 initial partitions
87(No Transcript)
88(No Transcript)
89Visualisation in Expression Profiler
90Outline
- Missing Value Estimation
- Differentially Expressed Genes
- Clustering Algorithms
- Principal Components Analysis
91PCA (Dimensionality Reduction Methods)
92Outline
- Dimensionality Problem
- Techniques Methods
- Multidimensional Scaling
- Eigenanalysis-based ordination methods
- Principal Component Analysis (PCA)
- Correspondence Analysis (CA)
93Dimensionality problem
- Problem?
- Curse of dimensionality
- Convergence of any estimator to the true value of
a smooth function on a space of high dimension is
very slow - In other words, need many observations to obtain
a good estimate of gene function - Blessing? very few things really matter
- Solutions
- Statistical techniques (corrections, etc.)
- Reduce dimensionality
- Ignore non-variable genes
- Feature subset selection
- Eliminate coordinates that are less relevant
94Multidimensional Scaling
Idea place data in a low-dimensional space so
that similar objects are close to each other.
- The Algorithm (roughly)
- Assign points to arbitrary coordinates in
p-dimensional space. - Compute all-against-all distances, to form a
matrix D. - Compare D with the input matrix D by evaluating
the stress function. The smaller the value, the
greater the correspondence between the two. - Adjust coordinates of each point in the direction
that best maximizes stress. - Repeat steps 2 through 4 until stress won't get
any lower.
- However
- Computationally intensive
- Axes are meaningless, orientation of the MDS map
is arbitrary - Difficult to interpret
95Eigenanalysis Background
Basic Concepts An eigenvalue and eigenvector of a
square matrix A are a scalar ? and a nonzero
vector x so that Ax ?x
Q What is a matrix? A A linear
transformation. Q What are eigenvectors?A
Directions in which the transformation takes
place the most Exploratory example
EigenExplorer
96Eigenanalysis Background
Finding eigenvalues Ax
?x (A ?I)x 0
- Interpreting eigenvalues
- Eigenvalues of a matrix provide a solid rotation
in the directions of highest variance - Can pick N largest eigenvalues, capture a large
proportion of the variance and represent every
value in the original matrix as a linear
combination of these values, e.g., xi a1?1 . .
. aN?N - Call this collection aj the eigengene/eigenarra
y (depending on which way we compute these)
97PCA
- PCA simplifies the views of the data.
- Suppose we have measurements for each gene on
multiple experiments. - Suppose some of the experiments are correlated.
- PCA will ignore the redundant experiments, and
will take a weighted average of some of the
experiments, thus possibly making the trends in
the data more interpretable. - 5. The components can be thought of as axes in
n-dimensional space, where n is the number of
components. Each axis represents a different
trend in the data.
98PCA
Data points resolved along 3 principal component
axes.
In this examplex-axis could mean a continuum
from over-to under-expression y-axis could mean
that blue genes are over-expressed in first
five expts and under expressed in the remaining
expts, while brown genes are under-expressed in
the first five expts, and over-expressed in the
remaining expts. z-axis might represent
different cyclic patterns, e.g., red genes
might be over-expressed in odd-numbered expts and
under-expressed in even-numbered ones, whereas
the opposite is true for purple
genes. Interpretation of components is somewhat
subjective.
99(No Transcript)
100(No Transcript)
101y
x
Projecting the data into alower dimensional
spacecan help visualize relationships
102y
x
Projecting the data into alower dimensional
spacecan help visualize relationships
103PCA in Expression Profiler
104Further Reading
- MDS
- http//www.analytictech.com/borgatti/mds.htm
- PCA, SVD
- http//www.statsoftinc.com/textbook/stfacan.html
- http//linneus20.ethz.ch8080/2_2_1.html
- Alter et al., Singular value decomposition for
genome-wide expression data processing and
modelling, PNAS, 2000 - COA
- Fellenberg et al., Correspondence analysis
applied to microarray data, PNAS, 2001 - General ordination
- http//www.okstate.edu/artsci/botany/ordinate/
- Legendre P. and Legendre L., Numerical Ecology,
1998