Quick R Tips - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Quick R Tips

Description:

Title: PowerPoint Presentation Author: nbhagaba Last modified by: Carol Bult Created Date: 7/8/2002 2:59:22 PM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 50

Provided by: nbh9

Learn more at: https://www.informatics.jax.org

Category:

more less

Transcript and Presenter's Notes

Title: Quick R Tips

1
Quick R Tips

How to find out what packages are available
library()
How to find out what packages are actually
installed locally
(.packages())

2
Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
Image analysis
Normalization
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
3
Microarray Data Flow
Microarray experiment
Unsupervised Analysis clustering
Image Analysis
Database
Supervised Analysis
Data Selection Missing value estimation
Normalization Centering
Networks Data Integration
Data Matrix
Decomposition techniques
4
A note about Affymetrix (1-color) pre-processing
cross-chip
sequence specific background correction
within-chip
within-probe set aggregation of intensity values

Two standard methods
MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares
PM and MM probes)
RMA by Speed group (UC Berkeley) (ignores MM
probes)

5
Why normalize?

Microarray data have significant systematic
variation both within arrays and between arrays
that is not true biological variation
Accurate comparison of genes relative
expression within and across conditions requires
normalization of effects
Sources of variation
Spatial location on the array
Dye biases which vary with spot intensity
Plate origin
Printing/spotting quality
Experimenter

6
Normalization Thoughts

There are many different ways to normalize data
Global median, LOWESS, LOESS, RMA etc
By print tip, spatial, etc
BUT dont expect it to fix bad data!
Wont make up for lack of replicates
Wont make up for horrible slides

7
Create a boxplot of the normalized
data boxplot(mydata-1, main "Normalized
Intensities", xlab"Array", ylab"Intensities",
col"blue") To save the boxplot as a jpeg
file jpeg("normal_boxplot.jpg") boxplot(mydata-1
, main "Normalized Intensities", xlab"Array",
ylab"Intensities", col"blue") dev.off()
8
Microarray Data Analysis(slides used with
permission of Dr. John Quackenbush, Dana Farber
creator of MeV software )http//www.tm4.org/mev/
9

Classification
Hierarchical clustering
K-means clustering
Coherence
PCA
Relevance Network
Differential gene expression
T-test
Analysis of Variance
Significance of Microarray (SAM)

10
Hierarchical Clustering

A type of cluster analysis
There is both divisive and agglomerative
HCagglomerative is most commonly used
Group objects that are close to one another
based on some distance/similarity metric
Clusters are created and linked based on a metric
that evaluates the cluster-to-cluster distance
Results are displayed as a dendrogram

11
Hierarchical Clustering
g1 is most like g8
g4 is most like g1, g8
(HCL-2)
12
Hierarchical Clustering
g5 is most like g7
g5,g7 is most like g1, g4, g8
(HCL-3)
13
Hierarchical Tree
(HCL-4)
14
Hierarchical Clustering
During construction of the hierarchy, decisions
must be made to determine which clusters should
be joined. The distance or similarity between
clusters must be calculated. The rules that
govern this calculation are linkage methods.
(HCL-5)
15
Agglomerative Linkage Methods

Linkage methods are rules or metrics that return
a value that can be used to determine which
elements (clusters) should be linked.
Three linkage methods that are commonly used are
Single Linkage
Average Linkage
Complete Linkage

(HCL-6)
16
Comparison of Linkage Methods
(HCL-10)
17
Bootstrapping (ST)
Bootstrapping resampling with replacement
Original expression matrix
Various bootstrapped matrices (by experiments)
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
18
Jackknifing (ST)
Jackknifing resampling without replacement
Original expression matrix
Various jackknifed matrices (by experiments)
19
Analysis of Bootstrapped and Jackknifed Support
Trees

Bootstrapped or jackknifed expression matrices
are created many times by randomly resampling the
original expression matrix, using either the
bootstrap or jackknife procedure.
Each time, hierarchical trees are created from
the resampled matrices.
The trees are compared to the tree obtained from
the original data set.
The more frequently a given cluster from the
original tree is found in the resampled trees,
the stronger the support for the cluster.
As each resampled matrix lacks some of the
original data, high support for a cluster means
that the clustering is not biased by a small
subset of the data.

20
Hierarchical Clustering in R
21
Step 1 Data matrix

First you need a numeric matrix
Typical array data set will have samples as
columns and genes as rows
We want to be sure our data are in the form of an
expression matrix
Use Biobase library/package
See http//www.bioconductor.org/packages/2.2/bioc/
vignettes/Biobase/inst/doc/ExpressionSetIntroducti
on.pdf
gt exprslt-as.matrix(data, headerTRUE, sep"\t",
row.names1, as.isTRUE)

22
Step 2 Calculate Distance Matrix

Default dist() method in R uses rows as the
vectors..but we want the distance between
samples.i.e., the columns of our matrix.
There is a handy package to help us at MD
Anderson called oompaBase
source("http//bioinformatics.mdanderson.org/OOMPA
/oompaLite.R")
oompaLite()
oompainstall(groupName"all")
Once installed, be sure to locally activate the
libraries
library(oompaBase)
library(ClassDiscovery)
library(ClassComparison)
oompaBase also requires the mclust and cobs
packagesdownload these from CRAN

Use the function distanceMatrix() to create a
distance matrix of your samples.
Uses the expression set created in Step 1 as
input
Remember that there are many different types of
distance metrics to choose from!
See help(distanceMatrix)
xlt- distanceMatrix(exprs,'pearson')

24
Step 3 Cluster

Use the hclust() function to create a
hierarchical cluster based on your distance
matrix, x, created in Step 2.
gt ylt-hclust(x,method"complete")
gt plot(y)

25
Testing for Differential Gene Expression with the
T-test
26

Get the multtest package from CRAN
Package contains data from the Golub leukemia
microarray data set (ALL v AML)
38 arrays
27 from lymphoblastic
11 from myeloid

http//people.cryst.bbk.ac.uk/wernisch/macourse/
27

library(multtest)
data(golub)
golub.cl
Generate the T statistic
teststat lt-mt.teststat(golub, golub.cl)
Convert into P-values
rawp0 lt-2pt(abs(teststat),lower.tailF, df38-2)
Correct for multiple testing and show the ten
most significant genes
procs lt-c(Bonferroni, BH)
reslt-mt.rawp2adjp((rawp0), procs)
resadjp110,

http//people.cryst.bbk.ac.uk/wernisch/macourse/
28
K-Means / K-Medians Clustering (KMC) 1
1. Specify number of clusters, e.g., 5.
2. Randomly assign genes to clusters.
29
K-Means Clustering 2
3. Calculate mean / median expression profile of
each cluster.
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean / median
expression profile (calculated in step 3) is the
closest to that genes expression profile.
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached.
K-Means / K-Medians is most useful when the user
has an a-priori hypothesis about the number of
clusters the genes should group into.
30
Principal Components (PCAG and PCAE) 1

PCA simplifies the views of the data.
Suppose we have measurements for each gene on
multiple
experiments.
Suppose some of the experiments are correlated.
PCA will ignore the redundant experiments, and
will take a
weighted average of some of the experiments, thus
possibly making
the trends in the data more interpretable.
5. The components can be thought of as axes in
n-dimensional
space, where n is the number of components. Each
axis represents a
different trend in the data.

31
PCAG and PCAE - 2
In this example, x-axis could mean a continuum
from over-to under-expression (blue and
green genes over-expressed, yellow genes
under-expressed) y-axis could mean that gray
genes are over-expressed in first five expts and
under expressed in The remaining expts, while
brown genes are under-expressed in the first
five expts, and over-expressed in the remaining
expts. z-axis might represent different cyclic
patterns, e.g., red genes might be
over-expressed in odd-numbered expts and
under-expressed in even-numbered ones, whereas
the opposite is true for purple
genes. Interpretation of components is somewhat
subjective.
32
Relevance Networks
Set of genes whose expression profiles are
predictive of one another.
Can be used to identify negative correlations
between genes
Genes with low entropy (least variable across
experiments) are excluded from analysis.
33
Relevance Networks
Tmin 0.50
The expression pattern of each gene compared to
that of every other gene.
The remaining relationships between genes define
the subnets
Tmax 0.90
Correlation coefficients outside the boundaries
defined by the minimum and maximum thresholds are
eliminated.
The ability of each gene to predict the
expression of each other gene is assigned a
correlation coefficient
34
T-Tests (TTEST) Between subjects (or unpaired)
- 1

Assign experiments to two groups, e.g., in the
expression matrix
below, assign Experiments 1, 2 and 5 to group A,
and
experiments 3, 4 and 6 to group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
35
TTEST Between subjects - 2
3. Calculate t-statistic for each gene
4. Calculate probability value of the t-statistic
for each gene either from A. Theoretical
t-distribution OR B. Permutation tests.
36
TTEST - Between subjects - 3
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Original grouping
Randomized grouping
37
TTEST - Between subjects - 4
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi
) Then, the p-value associated with the gene 1
(x/n)
38
TTEST - Between subjects - 5

5. Determine whether a genes expression levels
are significantly
different between the two groups by one of three
methods
Just alpha If the calculated p-value for a gene
is less than
or equal to the user-input alpha (critical
p-value), the gene is
considered significant.
OR
Use Bonferroni corrections to reduce the
probability of
erroneously classifying non-significant genes as
significant.
B) Standard Bonferroni correction The user-input
alpha is divided
by the total number of genes to give a critical
p-value that is used
as above.

39
TTEST - Between subjects 6
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (alpha / N), where N is
the total number of genes for the gene with the
second-highest t-value, the critical p-value will
be (alpha/ N-1), and so on.
40

The problem of multiple testing
(adapted from presentation by Anja von
Heydebreck, MaxPlanckInstitute for Molecular
Genetics,
Dept. Computational Molecular Biology, Berlin,
Germany
http//www.bioconductor.org/workshops/Heidelberg02
/mult.pdf)
Lets imagine there are 10,000 genes on a chip,
AND
None of them is differentially expressed.
Suppose we use a statistical test for
differential
expression, where we consider a gene to be
differentially expressed if it meets the
criterion at a
p-value of p lt 0.05.

The problem of multiple testing 2
Lets say that applying this test to gene G1
yields a p-value of p 0.01
Remember that a p-value of 0.01 means that there
is a 1 chance that the gene is not
differentially expressed, i.e.,
Even though we conclude that the gene is
differentially expressed (because p lt 0.05),
there is a 1 chance that our conclusion is
wrong.
We might be willing to live with such a low
probability
of being wrong
BUT .....

The problem of multiple testing 3
We are testing 10,000 genes, not just one!!!
Even though none of the genes is differentially
expressed, about 5 of the genes (i.e., 500
genes) will be erroneously concluded to be
differentially expressed, because we have decided
to live with a p-value of 0.05
If only one gene were being studied, a 5 margin
of error might not be a big deal, but 500 false
conclusions in one study? That doesnt sound too
good.

The problem of multiple testing - 4
There are tricks we can use to reduce the
severity of
this problem.
They all involve slashing the p-value for each
test
(i.e., gene), so that while the critical p-value
for the entire
data set might still equal 0.05, each gene will
be
evaluated at a lower p-value.

Dont get too hung up on p-values.
Ultimately, what matters is biological
relevance.
P-values should help you evaluate the strength of
the
evidence, rather than being used as an absolute
yardstick
of significance. Statistical significance is not
necessarily
the same as biological significance.

45
Significance analysis of microarrays (SAM)

SAM can be used to pick out significant genes
based on differential expression between sets of
samples.

46
SAM -2

SAM gives estimates of the False Discovery Rate
(FDR), which is the proportion of genes likely to
have been wrongly identified by chance as being
significant.
It is a very interactive algorithm allows users
to dynamically change thresholds for significance
(through the tuning parameter delta) after
looking at the distribution of the test
statistic.
The ability to dynamically alter the input
parameters based on immediate visual feedback,
even before completing the analysis, should make
the data-mining process more sensitive.

47
SAM designs

Two-class unpaired to pick out genes whose mean
expression level is significantly different
between two groups of samples (analogous to
between subjects t-test).
Two-class paired samples are split into two
groups, and there is a 1-to-1 correspondence
between an sample in group A and one in group B
(analogous to paired t-test).

48
SAM designs - 2

Multi-class picks up genes whose mean expression
is different across gt 2 groups of samples
(analogous to one-way ANOVA)
Censored survival picks up genes whose
expression levels are correlated with duration of
survival.
One-class picks up genes whose mean expression
across experiments is different from a
user-specified mean.

49
SAM Two-Class Unpaired 4

Write a Comment

User Comments (0)