Title: Quick R Tips
1Quick R Tips
- How to find out what packages are available
- library()
- How to find out what packages are actually
installed locally - (.packages())
2Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
Image analysis
Normalization
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
3Microarray Data Flow
Microarray experiment
Unsupervised Analysis clustering
Image Analysis
Database
Supervised Analysis
Data Selection Missing value estimation
Normalization Centering
Networks Data Integration
Data Matrix
Decomposition techniques
4A note about Affymetrix (1-color) pre-processing
cross-chip
sequence specific background correction
within-chip
within-probe set aggregation of intensity values
- Two standard methods
- MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares
PM and MM probes) - RMA by Speed group (UC Berkeley) (ignores MM
probes)
5Why normalize?
- Microarray data have significant systematic
variation both within arrays and between arrays
that is not true biological variation - Accurate comparison of genes relative
expression within and across conditions requires
normalization of effects - Sources of variation
- Spatial location on the array
- Dye biases which vary with spot intensity
- Plate origin
- Printing/spotting quality
- Experimenter
6Normalization Thoughts
- There are many different ways to normalize data
- Global median, LOWESS, LOESS, RMA etc
- By print tip, spatial, etc
- BUT dont expect it to fix bad data!
- Wont make up for lack of replicates
- Wont make up for horrible slides
7Create a boxplot of the normalized
data boxplot(mydata-1, main "Normalized
Intensities", xlab"Array", ylab"Intensities",
col"blue") To save the boxplot as a jpeg
file jpeg("normal_boxplot.jpg") boxplot(mydata-1
, main "Normalized Intensities", xlab"Array",
ylab"Intensities", col"blue") dev.off()
8Microarray Data Analysis(slides used with
permission of Dr. John Quackenbush, Dana Farber
creator of MeV software )http//www.tm4.org/mev/
9- Classification
- Hierarchical clustering
- K-means clustering
- Coherence
- PCA
- Relevance Network
- Differential gene expression
- T-test
- Analysis of Variance
- Significance of Microarray (SAM)
10Hierarchical Clustering
- A type of cluster analysis
- There is both divisive and agglomerative
HCagglomerative is most commonly used - Group objects that are close to one another
based on some distance/similarity metric - Clusters are created and linked based on a metric
that evaluates the cluster-to-cluster distance - Results are displayed as a dendrogram
11Hierarchical Clustering
g1 is most like g8
g4 is most like g1, g8
(HCL-2)
12Hierarchical Clustering
g5 is most like g7
g5,g7 is most like g1, g4, g8
(HCL-3)
13Hierarchical Tree
(HCL-4)
14Hierarchical Clustering
During construction of the hierarchy, decisions
must be made to determine which clusters should
be joined. The distance or similarity between
clusters must be calculated. The rules that
govern this calculation are linkage methods.
(HCL-5)
15Agglomerative Linkage Methods
- Linkage methods are rules or metrics that return
a value that can be used to determine which
elements (clusters) should be linked. - Three linkage methods that are commonly used are
- Single Linkage
- Average Linkage
- Complete Linkage
(HCL-6)
16Comparison of Linkage Methods
(HCL-10)
17Bootstrapping (ST)
Bootstrapping resampling with replacement
Original expression matrix
Various bootstrapped matrices (by experiments)
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
18Jackknifing (ST)
Jackknifing resampling without replacement
Original expression matrix
Various jackknifed matrices (by experiments)
19Analysis of Bootstrapped and Jackknifed Support
Trees
- Bootstrapped or jackknifed expression matrices
are created many times by randomly resampling the
original expression matrix, using either the
bootstrap or jackknife procedure. - Each time, hierarchical trees are created from
the resampled matrices. - The trees are compared to the tree obtained from
the original data set. - The more frequently a given cluster from the
original tree is found in the resampled trees,
the stronger the support for the cluster. - As each resampled matrix lacks some of the
original data, high support for a cluster means
that the clustering is not biased by a small
subset of the data.
20Hierarchical Clustering in R
21Step 1 Data matrix
- First you need a numeric matrix
- Typical array data set will have samples as
columns and genes as rows - We want to be sure our data are in the form of an
expression matrix - Use Biobase library/package
- See http//www.bioconductor.org/packages/2.2/bioc/
vignettes/Biobase/inst/doc/ExpressionSetIntroducti
on.pdf - gt exprslt-as.matrix(data, headerTRUE, sep"\t",
row.names1, as.isTRUE)
22Step 2 Calculate Distance Matrix
- Default dist() method in R uses rows as the
vectors..but we want the distance between
samples.i.e., the columns of our matrix. - There is a handy package to help us at MD
Anderson called oompaBase - source("http//bioinformatics.mdanderson.org/OOMPA
/oompaLite.R") - oompaLite()
- oompainstall(groupName"all")
- Once installed, be sure to locally activate the
libraries - library(oompaBase)
- library(ClassDiscovery)
- library(ClassComparison)
- oompaBase also requires the mclust and cobs
packagesdownload these from CRAN
23- Use the function distanceMatrix() to create a
distance matrix of your samples. - Uses the expression set created in Step 1 as
input - Remember that there are many different types of
distance metrics to choose from! - See help(distanceMatrix)
- xlt- distanceMatrix(exprs,'pearson')
24Step 3 Cluster
- Use the hclust() function to create a
hierarchical cluster based on your distance
matrix, x, created in Step 2. - gt ylt-hclust(x,method"complete")
- gt plot(y)
25Testing for Differential Gene Expression with the
T-test
26- Get the multtest package from CRAN
- Package contains data from the Golub leukemia
microarray data set (ALL v AML) - 38 arrays
- 27 from lymphoblastic
- 11 from myeloid
http//people.cryst.bbk.ac.uk/wernisch/macourse/
27- library(multtest)
- data(golub)
- golub.cl
- Generate the T statistic
- teststat lt-mt.teststat(golub, golub.cl)
- Convert into P-values
- rawp0 lt-2pt(abs(teststat),lower.tailF, df38-2)
- Correct for multiple testing and show the ten
most significant genes - procs lt-c(Bonferroni, BH)
- reslt-mt.rawp2adjp((rawp0), procs)
- resadjp110,
http//people.cryst.bbk.ac.uk/wernisch/macourse/
28K-Means / K-Medians Clustering (KMC) 1
1. Specify number of clusters, e.g., 5.
2. Randomly assign genes to clusters.
29K-Means Clustering 2
3. Calculate mean / median expression profile of
each cluster.
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean / median
expression profile (calculated in step 3) is the
closest to that genes expression profile.
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached.
K-Means / K-Medians is most useful when the user
has an a-priori hypothesis about the number of
clusters the genes should group into.
30Principal Components (PCAG and PCAE) 1
- PCA simplifies the views of the data.
- Suppose we have measurements for each gene on
multiple - experiments.
- Suppose some of the experiments are correlated.
- PCA will ignore the redundant experiments, and
will take a - weighted average of some of the experiments, thus
possibly making - the trends in the data more interpretable.
- 5. The components can be thought of as axes in
n-dimensional - space, where n is the number of components. Each
axis represents a - different trend in the data.
31PCAG and PCAE - 2
In this example, x-axis could mean a continuum
from over-to under-expression (blue and
green genes over-expressed, yellow genes
under-expressed) y-axis could mean that gray
genes are over-expressed in first five expts and
under expressed in The remaining expts, while
brown genes are under-expressed in the first
five expts, and over-expressed in the remaining
expts. z-axis might represent different cyclic
patterns, e.g., red genes might be
over-expressed in odd-numbered expts and
under-expressed in even-numbered ones, whereas
the opposite is true for purple
genes. Interpretation of components is somewhat
subjective.
32Relevance Networks
Set of genes whose expression profiles are
predictive of one another.
Can be used to identify negative correlations
between genes
Genes with low entropy (least variable across
experiments) are excluded from analysis.
33Relevance Networks
Tmin 0.50
The expression pattern of each gene compared to
that of every other gene.
The remaining relationships between genes define
the subnets
Tmax 0.90
Correlation coefficients outside the boundaries
defined by the minimum and maximum thresholds are
eliminated.
The ability of each gene to predict the
expression of each other gene is assigned a
correlation coefficient
34T-Tests (TTEST) Between subjects (or unpaired)
- 1
- Assign experiments to two groups, e.g., in the
expression matrix - below, assign Experiments 1, 2 and 5 to group A,
and - experiments 3, 4 and 6 to group B.
2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
35TTEST Between subjects - 2
3. Calculate t-statistic for each gene
4. Calculate probability value of the t-statistic
for each gene either from A. Theoretical
t-distribution OR B. Permutation tests.
36TTEST - Between subjects - 3
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Original grouping
Randomized grouping
37TTEST - Between subjects - 4
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi
) Then, the p-value associated with the gene 1
(x/n)
38TTEST - Between subjects - 5
- 5. Determine whether a genes expression levels
are significantly - different between the two groups by one of three
methods - Just alpha If the calculated p-value for a gene
is less than - or equal to the user-input alpha (critical
p-value), the gene is - considered significant.
-
- OR
- Use Bonferroni corrections to reduce the
probability of - erroneously classifying non-significant genes as
significant. - B) Standard Bonferroni correction The user-input
alpha is divided - by the total number of genes to give a critical
p-value that is used - as above.
39TTEST - Between subjects 6
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (alpha / N), where N is
the total number of genes for the gene with the
second-highest t-value, the critical p-value will
be (alpha/ N-1), and so on.
40- The problem of multiple testing
- (adapted from presentation by Anja von
Heydebreck, MaxPlanckInstitute for Molecular
Genetics, - Dept. Computational Molecular Biology, Berlin,
Germany - http//www.bioconductor.org/workshops/Heidelberg02
/mult.pdf) - Lets imagine there are 10,000 genes on a chip,
AND - None of them is differentially expressed.
- Suppose we use a statistical test for
differential - expression, where we consider a gene to be
differentially expressed if it meets the
criterion at a - p-value of p lt 0.05.
41- The problem of multiple testing 2
- Lets say that applying this test to gene G1
yields a p-value of p 0.01 - Remember that a p-value of 0.01 means that there
is a 1 chance that the gene is not
differentially expressed, i.e., - Even though we conclude that the gene is
differentially expressed (because p lt 0.05),
there is a 1 chance that our conclusion is
wrong. - We might be willing to live with such a low
probability - of being wrong
- BUT .....
42- The problem of multiple testing 3
- We are testing 10,000 genes, not just one!!!
- Even though none of the genes is differentially
expressed, about 5 of the genes (i.e., 500
genes) will be erroneously concluded to be
differentially expressed, because we have decided
to live with a p-value of 0.05 - If only one gene were being studied, a 5 margin
of error might not be a big deal, but 500 false
conclusions in one study? That doesnt sound too
good.
43- The problem of multiple testing - 4
- There are tricks we can use to reduce the
severity of - this problem.
- They all involve slashing the p-value for each
test - (i.e., gene), so that while the critical p-value
for the entire - data set might still equal 0.05, each gene will
be - evaluated at a lower p-value.
44- Dont get too hung up on p-values.
- Ultimately, what matters is biological
relevance. - P-values should help you evaluate the strength of
the - evidence, rather than being used as an absolute
yardstick - of significance. Statistical significance is not
necessarily - the same as biological significance.
45Significance analysis of microarrays (SAM)
- SAM can be used to pick out significant genes
based on differential expression between sets of
samples.
46SAM -2
- SAM gives estimates of the False Discovery Rate
(FDR), which is the proportion of genes likely to
have been wrongly identified by chance as being
significant. - It is a very interactive algorithm allows users
to dynamically change thresholds for significance
(through the tuning parameter delta) after
looking at the distribution of the test
statistic. - The ability to dynamically alter the input
parameters based on immediate visual feedback,
even before completing the analysis, should make
the data-mining process more sensitive.
47SAM designs
- Two-class unpaired to pick out genes whose mean
expression level is significantly different
between two groups of samples (analogous to
between subjects t-test). - Two-class paired samples are split into two
groups, and there is a 1-to-1 correspondence
between an sample in group A and one in group B
(analogous to paired t-test).
48SAM designs - 2
- Multi-class picks up genes whose mean expression
is different across gt 2 groups of samples
(analogous to one-way ANOVA) - Censored survival picks up genes whose
expression levels are correlated with duration of
survival. - One-class picks up genes whose mean expression
across experiments is different from a
user-specified mean.
49SAM Two-Class Unpaired 4