Microarrays: Common Analysis Approaches - PowerPoint PPT Presentation

1 / 104
About This Presentation
Title:

Microarrays: Common Analysis Approaches

Description:

Clustering Algorithms. Principal Components Analysis. Outline ... It is a very interactive algorithm allows users to dynamically change ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 105
Provided by: PhilippeR4
Category:

less

Transcript and Presenter's Notes

Title: Microarrays: Common Analysis Approaches


1
MicroarraysCommon Analysis Approaches
2
Outline
  • Missing Value Estimation
  • Differentially Expressed Genes
  • Clustering Algorithms
  • Principal Components Analysis

3
Missing Data Outline
  • Missing data problem, basic concepts and
    terminology
  • Classes of procedures
  • Case deletion
  • Single imputation
  • Filling with zeroes
  • Row averaging
  • SVD imputation
  • KNN imputation
  • Multiple imputation

4
The Missing Data Problem
  • Causes for missing data
  • Low resolution
  • Image corruption
  • Dust/scratched slides
  • Missing measurements
  • Why estimate missing values?
  • Many algorithms cannot deal with missing values
  • Distance measure-dependent algorithms(e.g.,
    clustering, similarity searches)

5
Basic concepts and terminology
Statistical overview
Missing data mechanism
Sample of complete data ?s
Sample of incomplete data ?i
Population of complete data ?
Sample
Need to estimate ? from the incomplete data and
investigate its performance over repetitions of
the sampling procedure
6
Basic concepts
Y sample data f(Y?) distribution
of sample data ? parameters to be
estimated R indicators, whether
elements of Y are observed or missing g(RY)
missing data mechanism (maybe with other
params) Y (Yobs, Ymis) Yobs observed part of
Y Ymis missing part of Y Goal Propose
methods to estimate ? from Yobs and accurately
assess its error
7
Basic concepts (cont.)
  • Classes of mechanisms (cf. Rubin, 1976,
    Biometrika)
  • Missing Completely At Random (MCAR)
  • g(RY) does not depend on Y
  • Missing At Random (MAR)
  • ?g(RY) may depend on Yobs but not on Ymis
  • Missing Not At Random (MNAR)
  • ?g(RY) depends on Ymis

8
Example
  • Suppose we measure age and income of a collection
    of individuals
  • MCAR
  • The dog ate the response sheets!
  • MAR
  • Probability that the income measurement is
    missing varies according to the age but not
    income
  • MNAR
  • Probability that an income is recorded varies
    according to the income level with each age group

Note we can disprove MCAR by examining the data,
but we cannot disprove MAR or MNAR.
9
Outline
  • Missing data problem, basic concepts and
    terminology
  • Classes of procedures
  • Case deletion
  • Single imputation
  • Filling with zeroes
  • Row averaging
  • SVD imputation
  • KNN imputation
  • Multiple imputation

10
Classes of procedures Case Deletion
  • Remove subjects with missing values on any item
    needed for analysis
  • Advantages
  • Easy
  • Valid analysis under MCAR
  • OK if proportion of missing cases is small and
    they are not overly influential
  • Disadvantages
  • Can be inefficient, may discard a very high
    proportion of cases (5669 out of 6178 rows
    discarded in Spellman yeast data)
  • May introduce substantial bias, if missing data
    are not MCAR (complete cases may be
    un-representative of the population)

11
Classes of procedures Single Imputation (I)
  • Replace with zeroes
  • Fill-in all missing values with zeroes
  • Advantages
  • Easy
  • Disadvantages
  • Distorts the data disproportionately (changes
    statistical properties)
  • May introduce bias
  • Why zero?

12
Classes of procedures Single Imputation (II)
  • Row averaging
  • Replace missing values by the row average for
    that row
  • Advantages
  • Easy
  • Keeps same mean
  • Disadvantages
  • Distorts distributions and relationships between
    variables

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
13
Classes of procedures Single Imputation (III)
  • Hot deck imputation
  • Replace each missing value by a randomly drawn
    observed value
  • Advantages
  • Easy
  • Preserves distributions very well
  • Disadvantages
  • May distort relationships
  • Can use, e.g., similar rows to draw random
    values from (to help constrain distortion)
  • Depend on definition of similar

14
Classes of procedures Single Imputation (IV)
  • Regression imputation
  • Fit regression to observed values, use it to
    obtain predictions for missing ones
  • SVD imputation
  • Fill missing entries with regressed values from a
    set of characteristic patterns, using
    coefficients determined by the proximity of the
    missing row to the patterns
  • KNN imputation (more later)
  • Isolate rows whose values are similar to those of
    the one with missing values (choosing (i)
    similarity measure, and (ii) size of this set)
  • Fill missing values with averages from this set
    of genes, with weights inversely proportional to
    similarities
  • Computationally intensive
  • May distort relationships between variables
    (could use Yimprandom residual)

15
Classes of procedures Multiple Imputation
  • Main Idea
  • Replace Ymis by Mgt1 independent draws
  • Y1mis,, YMmis P(Ymis Yobs )
  • Produce M different versions of complete data
  • Analyse each one in same fashion and combine
    results at the end, with standard error estimates
    (Rubin, 1987)
  • More difficult to implement
  • Requires (initially) more computations
  • More work involved in interpreting results

16
KNN Imputation
  • Troyanskaya et al., Bioinformatics, 2001
  • The Algorithm
  • 0. Given gene A with missing values
  • Find K other genes with values present in
    experiment 1, with expression most similar to A
    in other experiments
  • Weighted average of values in experiment 1 from
    the K closest genes is used as an estimate for
    the missing value in A

17
KNN Imputation Considerations
  • K the number of nearest neighbours
  • Method appears to be relatively insensitive to K
    within the range 10-20
  • Distance metric to be used for computing gene
    similarity
  • Troyanskaya Euclidean is sufficient
  • No clear comparison or reason would expect that
    metric to be used depends on the type of
    experiment
  • Not recommended on matrices with less than four
    columns
  • Computationally intensive!
  • O(m2n) for m rows and n genes
  • 3.23 minutes on a Pentium III 500 MHz for 6153
    genes, 14 experiments with 10 of the entries
    missing

18
KNN Imputation Expression Profiler
19
Outline
  • Missing Value Estimation
  • Differentially Expressed Genes
  • Clustering Algorithms
  • Principal Components Analysis

20
Identifying Differentially Expressed Genes
Slides courtesy of John Quackenbush, TIGR
21
Two vs. Multiple conditions
  • Two conditions - t-test - Significance
    analysis of microarrays (SAM) - Volcano Plots
  • - ANOVA
  • Multiple conditions - Clustering - K-means
    - PCA

22
How Many Replicates??
n 4(za/2 zb)2 / (d/1.4s)2
Where za/2 and zb are normal percentile values
at false positive rate a ?Type I error
ratefalse negative rate b ?Type II error
rate, d represents the minimum detectable log2
ratio and s represents the SD of log ratio
values. For a 0.001 and b 0.05, get za/2
-3.29 and zb -1.65. Assume d 1.0 (2-fold
change) and s 0.25, ? n 12 samples (6 query
and 6 control) ?
(Simon et al., Genetic Epidemiology 23 21-36,
2002)
23
Some Concepts from Statistics
24
Probability Distributions
  • The probability of an event is the likelihood
    of its occurring.
  • It is sometimes computed as a relative frequency
    (rf), where

The probability of an event can sometimes be
inferred from a theoretical probability
distribution, such as a normal distribution.
25
Normal Distribution
26
  • Less than a 5 chance that the sample with mean
    s came from Population 1
  • s is significantly different from Mean 1 at the
    p lt 0.05 significance level.
  • But we cannot reject the hypothesis that the
    sample came from Population 2

27
Probability and Expression Data
  • Many biological variables, such as height and
    weight, can reasonably be assumed to approximate
    the normal distribution.
  • But expression measurements? Probably not.
  • Fortunately, many statistical tests are
    considered to be fairly robust to violations of
    the normality assumption, and other assumptions
    used in these tests.
  • Randomization / resampling based tests can be
    used to get around the violation of the normality
    assumption.
  • Even when parametric statistical tests (the ones
    that make use of normal and other distributions)
    are valid, randomization tests are still useful.

28
Outline of a Randomisation Test
1. Compute the value of interest (i.e., the
test-statistic s) from your data set.
2. Make fake data sets from your original
data, by taking a random sub-sample of the data,
or by re-arranging the data in a random fashion.
Re-compute s from the fake data set.
29
Outline of a Randomisation Test (II)
3. Repeat step 2 many times (often several
hundred to several thousand times) and record of
the fake s values from step 2 4. Draw
inferences about the significance of your
original s value by comparing it with the
distribution of the randomized (fake) s values
30
Outline of a Randomisation Test (III)
  • Rationale
  • Ideally, we want to know the behavior of the
    larger population from which the sample is drawn,
    in order to make statistical inferences.
  • Here, we dont know that the larger population
    behaves like a normal distribution, or some
    other idealized distribution. All we have to work
    with are the data in hand.
  • Our fake data sets are our best guess about
    this behavior (i.e., if we had been pulling data
    at random from an infinitely large population, we
    might expect to get a distribution similar to
    what we get by pulling random sub-samples, or by
    reshuffling the order of the data in our sample)

31
The Problem of Multiple Testing (I)
  • Lets imagine there are 10,000 genes on a chip,
    and
  • none of them is differentially expressed.
  • Suppose we use a statistical test for
    differential expression, where we consider a gene
    to be differentially expressed if it meets the
    criterion at a p-value of p lt 0.05.

32
The Problem of Multiple Testing (II)
  • Lets say that applying this test to gene G1
    yields a p-value of p 0.01
  • Remember that a p-value of 0.01 means that there
    is a 1 chance that the gene is not
    differentially expressed, i.e.,
  • Even though we conclude that the gene is
    differentially expressed (because p lt 0.05),
    there is a 1 chance that our conclusion is
    wrong.
  • We might be willing to live with such a low
    probability of being wrong
  • BUT .....

33
The Problem of Multiple Testing (III)
  • We are testing 10,000 genes, not just one!!!
  • Even though none of the genes is differentially
    expressed, about 5 of the genes (i.e., 500
    genes) will be erroneously concluded to be
    differentially expressed, because we have decided
    to live with a p-value of 0.05
  • If only one gene were being studied, a 5 margin
    of error might not be a big deal, but 500 false
    conclusions in one study? That doesnt sound too
    good.

34
The Problem of Multiple Testing (IV)
  • There are tricks we can use to reduce the
    severity of this problem.
  • They all involve slashing the p-value for each
    test (i.e., gene), so that while the critical
    p-value for the entire data set might still equal
    0.05, each gene will be evaluated at a lower
    p-value.
  • Well go into some of these techniques later.

35
The Problem of Multiple Testing (V)
  • Dont get too hung up on p-values.
  • Ultimately, what matters is biological relevance.
  • P-values should help you evaluate the strength of
    the evidence, rather than being used as an
    absolute yardstick of significance.
  • Statistical significance is not necessarily the
    same as biological significance.

36
Finding Significant Genes
  • Assume we will compare two conditions with
    multiple replicates for each class
  • Our goal is to find genes that are significantly
    different between these classes
  • These are the genes that we will use for later
    data mining

37
Finding Significant Genes (II)
  • Average Fold Change Difference for each gene
  • suffers from being arbitrary and not taking into
    account systematic variation in the data

38
Finding Significant Genes (III)
  • t-test for each gene
  • Tests whether the difference between the mean of
    the query and reference groups are the same
  • Essentially measures signal-to-noise
  • Calculate p-value (permutations or distributions)
  • May suffer from intensity-dependent effects

39
T-Tests
40
T-Tests (I)
  • Assign experiments to two groups, e.g., in the
    expression matrix below, assign Experiments 1, 2
    and 5 to group A, and experiments 3, 4 and 6 to
    group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
41
T-Tests (II)
3. Calculate t-statistic for each gene
4. Calculate probability value of the
t-statistic for each gene either from A.
Theoretical t-distribution OR B.
Permutation tests.
42
T-Tests (III)
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
43
T-Tests (IV)
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi)
Then, the p-value associated with the gene 1
(x/n)
44
T-Tests (V)
  • 5. Determine whether a genes expression levels
    are significantly different between the two
    groups by one of three methods
  • Just alpha (a significance level) If the
    calculated p-value for a gene is less than or
    equal to the user-input a (critical p-value), the
    gene is considered significant.
  • OR
  • Use Bonferroni corrections to reduce the
    probability of erroneously classifying
    non-significant genes as significant.
  • B) Standard Bonferroni correction The user-input
    alpha is divided by the total number of genes to
    give a critical p-value that is used as above gt
    pcritical a/N.

45
T-Tests (VI)
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (a/N), where N is the
total number of genes for the gene with the
second-highest t-value, the critical p-value
will be (a/N-1), and so on.
46
Finding Significant Genes (IV)
  • Significance Analysis of Microarrays (SAM)- Uses
    a modified t-test by estimating and adding a
    small positive constant to the denominator-
    Significant genes are those which exceed the
    expected values from permutation analysis.

47
SAM
  • SAM can be used to select significant genes based
    on differential expression between sets of
    conditions
  • Currently implemented for two-class unpaired
    design i.e., we can select genes whose mean
    expression level is significantly different
    between two groups of samples (analogous to
    t-test).
  • Stanford University, Rob Tibshiranihttp//www-sta
    t.stanford.edu/tibs/SAM/index.html

48
SAM
  • SAM gives estimates of the False Discovery Rate
    (FDR), which is the proportion of genes likely to
    have been wrongly identified by chance as being
    significant.
  • It is a very interactive algorithm allows users
    to dynamically change thresholds for significance
    (through the tuning parameter delta) after
    looking at the distribution of the test
    statistic.
  • The ability to dynamically alter the input
    parameters based on immediate visual feedback,
    even before completing the analysis, should make
    the data-mining process more sensitive.

49
SAM Two-class
  • Assign experiments to two groups - in the
    expression matrix below Experiments 1, 2 and 5
    to group A Experiments 3, 4 and 6 to group B

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
50
SAM Two-class
Permutation tests
i) For each gene, compute d-value (analogous to
t-statistic). This is the observed d-value for
that gene.
ii) Randomly shuffle the values of the gene
between groups A and B, such that the
reshuffled groups A and B have the same
number of elements as the original groups A and
B. Compute the d-value for each randomized
gene
51
SAM Two-class
  • Repeat step (ii) many times, so that each gene
    has many randomized d-values. Take the average of
    the randomized d-values for each gene. This is
    the expected d-value of that gene.
  • Plot the observed d-values vs. the expected
    d-values

52
SAM Two-class
The more a gene deviates from the observed
expected line, the more likely it is to be
significant. Any gene beyond the first gene in
the ve or ve direction on the x-axis (including
the first gene), whose observed exceeds the
expected by at least delta, is considered
significant.
53
SAM Two-class
  • For each permutation of the data, compute the
    number of positive and negative significant genes
    for a given delta. The median number of
    significant genes from these permutations is the
    median False Discovery Rate.
  • The rationaleAny gene designated as significant
    from the randomized data are being picked up
    purely by chance (i.e., falsely discovered).
    Therefore, the median number picked up over many
    randomisations is a good estimate of false
    discovery rate.

54
Finding Significant Genes (V)
Volcano Plots
  • Effect vs. Significance
  • Selections of items that have both a large effect
    and are highly significant can be identified
    easily.

55
Volcano Plots
Using log10 for Y axis

Using log2 for X axis
56
Volcano Plots (II)
Using log10 for Y axis

Using log2 for X axis
57
Finding Significant Genes (VI)
  • Analysis of Variation (ANOVA)- Which genes are
    most significant for separating classes of
    samples?- Calculate p-value (permutations or
    distributions)- Reduces to a t-test for 2
    samples- May suffer from intensity-dependent
    effects

58
Multiple Conditions/Experiments
  • Goal is to identify genes (or conditions) which
    havesimilar patterns of expression
  • This is a problem in data mining
  • Clustering Algorithms are most widely used
  • All depend on how one measures distance

59
Pattern analysis
Pattern analysis
60
Expression Vectors
  • Each gene is represented by a vector where
    coordinates are its values log(ratio) in each
    experiment
  • - x log(ratio)exp1
  • - y log(ratio)exp2- z log(ratio)exp3
  • - etc.

61
Expression Vectors
  • Each gene is represented by a vector where
    coordinates are its values log(ratio) in each
    experiment
  • - x log(ratio)exp1
  • - y log(ratio)exp2
  • - z log(ratio)exp3
  • - etc.
  • For example, if we do six experiments,
  • - Gene1 (-1.2, -0.5, 0, 0.25, 0.75, 1.4)
  • - Gene2 (0.2, -0.5, 1.2, -0.25, -1.0, 1.5)
  • - Gene3 (1.2, 0.5, 0, -0.25, -0.75, -1.4)
  • - etc.

62
Expression Matrix
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
  • These gene expression vectors of log(ratio)
    values can be used to construct an expression
    matrix
  • Gene1 -1.2 -0.5 0 0.25 0.75 1.4
  • Gene2 0.2 -0.5 1.2 -0.25 -1.0 1.5
  • Gene3 1.2 0.5 0 -0.25 -0.75 -1.4
  • This is often represented as a red/green colored
    matrix

63
Expression Matrix
The Expression Matrix is a representation of data
from multiple microarray experiments.
Each element is a log ratio, usually log 2
(Cy5/Cy3)
Black indicates a log ratio of zero ( Cy5 Cy3
)
Green indicates a negative log ratio ( Cy5 lt Cy3
)
Gray indicates missing data
Red indicates a positive log ratio ( Cy5 gt Cy3 )
64
Expression Vectors as points inExpression Space
65
Distance measures
  • Distances are measured between expression
    vectors
  • Distance measures define the way we measure
    distances
  • Many different ways to measure distance
  • - Euclidean distance- Manhattan distance
  • - Pearson correlation- Spearman correlation
  • - etc.
  • Each has different properties and can reveal
    different features of the data

66
Euclidean distance
  • Measures the 'as-the-crow-flies' distance
  • Deriving the Euclidean distance between two data
    points involves computing the square root of
    the sum of the squares of the differences
    between corresponding values ( Pythagoras
    theorem )

67
Manhattan distance
  • Computes the distance that would be traveled to
    get from one data point to the other if a
    grid-like path is followed
  • Manhattan distance between two items is the sum
    of the differences of their corresponding
    components

68
Pearson and Pearson squared
  • Pearson Correlation measures the similarity in
    shape between two profiles
  • Pearson Squared distance measures the similarity
    in shape between two profiles, but can also
    capture inverse relationships

69
Spearman Rank Correlation
  • Spearman Rank Correlation measures the
    correlation between two sequences of values.
  • The two sequences are ranked separately and the
    differences in rank are calculated at each
    position, i.
  • Use Spearman Correlation to cluster together
    genes whose expression profiles have similar
    shapes or show similar general trends, but
    whose expression levels may be very different

Where Xi and Yi are the ith values of sequences X
and Y respectively
70
Distance Matrix
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
  • Once a distance metric has been selected, the
    starting point for all clustering methods is a
    distance matrix
  • Gene1 0 1.5 1.2 0.25 0.75 1.4
  • Gene2 1.5 0 1.3 0.55 2.0 1.5
  • Gene3 1.2 1.3 0 1.3 0.75 0.3
  • Gene4 0.25 0.55 1.3 0 0.25 0.4
  • Gene5 0.75 2.0 0.75 0.25 0 1.2
  • Gene6 1.4 1.5 0.3 0.4 1.2 0
  • The elements of this matrix are the pair-wise
    distances. ( matrix is symmetric around the
    diagonal )

71
Hierarchical Clustering
1. Calculate the distance between all genes. Find
the smallest distance. If several pairs share the
same similarity, use a predetermined rule to
decide between alternatives.
2. Fuse the two selected clusters to produce a
new cluster that now contains at least two
objects. Calculate the distance between the new
cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single
cluster remains.
4. Draw a tree representing the results.
72
Hierarchical Clustering
73
Hierarchical Clustering
74
Hierarchical Tree
75
Agglomerative Linkage Methods
  • Linkage methods are rules that determine which
    elements (clusters) should be linked.
  • Three linkage methods that are commonly used
    - Single Linkage - Average Linkage -
    Complete Linkage

76
Single Linkage
Cluster-to-cluster distance is defined as the
minimum distance between members of one cluster
and members of another cluster. Single linkage
tends to create elongated clusters with
individual genes chained onto clusters. DAB
min ( d(ui, vj) ) where u ÃŽ A and v ÃŽ B for all
i 1 to NA and j 1 to NB
77
Average Linkage
Cluster-to-cluster distance is defined as the
average distance between all members of one
cluster and all members of another cluster.
Average linkage has a slight tendency to produce
clusters of similar variance. DAB 1/(NANB) S
S ( d(ui, vj) ) where u ÃŽ A and v ÃŽ B for all
i 1 to NA and j 1 to NB
78
Complete Linkage
Cluster-to-cluster distance is defined as the
maximum distance between members of one cluster
and members of the another cluster. Complete
linkage tends to create clusters of similar size
and variability. DAB max ( d(ui, vj) ) where
u ÃŽ A and v ÃŽ B for all i 1 to NA and j 1 to
NB
79
Comparison of Linkage Methods
Average
Single
Complete
80
K-Means/Medians Clustering
81
K-Means/Medians Clustering
3. Calculate mean/median expression profile of
each cluster
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean expression
profile (calculated in step 3) is the closest to
that genes expression profile
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached
K-Means is most useful when the user has an a
priori hypothesis about the number of clusters
the genes should group into.
82
Clustering Comparison
  • MOTIVATION Using different clustering methods
    often produces different results. How do these
    clustering results relate to each other?
  • ? Clustering comparison method that finds a
    many-to-many correspondence in two different
    clustering results.
  • comparison of two flat clusterings
  • comparison of a flat and a hierarchical
    clustering.

83
Comparison of flat clusterings
C1 A1, A2, A3 , A 4
A2
A1
A3
A4
84
Indices to measure the overlapping
  • Intersection size
  • Simpsons index
  • Jaccard index

85
Comparison of flat and hierarchical clusterings
Selecting a point to cut the dendogram leads to s
disjoint groups.
86
Results
  • ARTIFICIAL DATA Four data sets with four
    clusters, constructed with the same four seeds
    and different levels of noise.
  • 1000 genes, 10 conditions
  • d 20 initial partitions

87
(No Transcript)
88
(No Transcript)
89
Visualisation in Expression Profiler
90
Outline
  • Missing Value Estimation
  • Differentially Expressed Genes
  • Clustering Algorithms
  • Principal Components Analysis

91
PCA (Dimensionality Reduction Methods)
92
Outline
  • Dimensionality Problem
  • Techniques Methods
  • Multidimensional Scaling
  • Eigenanalysis-based ordination methods
  • Principal Component Analysis (PCA)
  • Correspondence Analysis (CA)

93
Dimensionality problem
  • Problem?
  • Curse of dimensionality
  • Convergence of any estimator to the true value of
    a smooth function on a space of high dimension is
    very slow
  • In other words, need many observations to obtain
    a good estimate of gene function
  • Blessing? very few things really matter
  • Solutions
  • Statistical techniques (corrections, etc.)
  • Reduce dimensionality
  • Ignore non-variable genes
  • Feature subset selection
  • Eliminate coordinates that are less relevant

94
Multidimensional Scaling
Idea place data in a low-dimensional space so
that similar objects are close to each other.
  • The Algorithm (roughly)
  • Assign points to arbitrary coordinates in
    p-dimensional space.
  • Compute all-against-all distances, to form a
    matrix D.
  • Compare D with the input matrix D by evaluating
    the stress function. The smaller the value, the
    greater the correspondence between the two.
  • Adjust coordinates of each point in the direction
    that best maximizes stress.
  • Repeat steps 2 through 4 until stress won't get
    any lower.
  • However
  • Computationally intensive
  • Axes are meaningless, orientation of the MDS map
    is arbitrary
  • Difficult to interpret

95
Eigenanalysis Background
Basic Concepts An eigenvalue and eigenvector of a
square matrix A are a scalar ? and a nonzero
vector x so that Ax ?x
Q What is a matrix? A A linear
transformation. Q What are eigenvectors?A
Directions in which the transformation takes
place the most Exploratory example
EigenExplorer
96
Eigenanalysis Background
Finding eigenvalues Ax
?x (A ?I)x 0
  • Interpreting eigenvalues
  • Eigenvalues of a matrix provide a solid rotation
    in the directions of highest variance
  • Can pick N largest eigenvalues, capture a large
    proportion of the variance and represent every
    value in the original matrix as a linear
    combination of these values, e.g., xi a1?1 . .
    . aN?N
  • Call this collection aj the eigengene/eigenarra
    y (depending on which way we compute these)

97
PCA
  • PCA simplifies the views of the data.
  • Suppose we have measurements for each gene on
    multiple experiments.
  • Suppose some of the experiments are correlated.
  • PCA will ignore the redundant experiments, and
    will take a weighted average of some of the
    experiments, thus possibly making the trends in
    the data more interpretable.
  • 5. The components can be thought of as axes in
    n-dimensional space, where n is the number of
    components. Each axis represents a different
    trend in the data.

98
PCA
Data points resolved along 3 principal component
axes.
In this examplex-axis could mean a continuum
from over-to under-expression y-axis could mean
that blue genes are over-expressed in first
five expts and under expressed in the remaining
expts, while brown genes are under-expressed in
the first five expts, and over-expressed in the
remaining expts. z-axis might represent
different cyclic patterns, e.g., red genes
might be over-expressed in odd-numbered expts and
under-expressed in even-numbered ones, whereas
the opposite is true for purple
genes. Interpretation of components is somewhat
subjective.
99
(No Transcript)
100
(No Transcript)
101
y
x
Projecting the data into alower dimensional
spacecan help visualize relationships
102
y
x
Projecting the data into alower dimensional
spacecan help visualize relationships
103
PCA in Expression Profiler
104
Further Reading
  • MDS
  • http//www.analytictech.com/borgatti/mds.htm
  • PCA, SVD
  • http//www.statsoftinc.com/textbook/stfacan.html
  • http//linneus20.ethz.ch8080/2_2_1.html
  • Alter et al., Singular value decomposition for
    genome-wide expression data processing and
    modelling, PNAS, 2000
  • COA
  • Fellenberg et al., Correspondence analysis
    applied to microarray data, PNAS, 2001
  • General ordination
  • http//www.okstate.edu/artsci/botany/ordinate/
  • Legendre P. and Legendre L., Numerical Ecology,
    1998
Write a Comment
User Comments (0)
About PowerShow.com