Title: Some Statistical Issues in Analyzing Gene Expression
1Some Statistical Issues in Analyzing Gene
Expression
Geoff McLachlan Department of Mathematics
Institute for Molecular Bioscience University of
Queensland http//www.maths.uq.edu.au/gjm
2Institute for Molecular Bioscience, University
of Queensland
3Todays Programme Topic is Analysis of Gene
Expressions
4Analyzing Microarray Gene Expression Data
5Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
6Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression Data
The Analysis of Gene Expression Data
7Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data
The Statistical Analysis of Gene Expression Data
8Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression Data
Statistics for Microarrays
9Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression
Data Statistics for Microarrays
Design and Analysis of DNA Microarrays
10Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression
Data Statistics for Microarrays Design and
Analysis of DNA Microarrays
Exploration and Analysis of Microarrays
11In the sequel, references to most of the material
presented can be found in my joint
book, McLachlan, Do, and Ambroise (2004),
Analyzing Microarray Gene Expression Data,
Hoboken, NJ Wiley.
12Distribution of References by Year
13Contents
- Microarrays in Gene Expression Studies
- Cleaning and Normalization
- Some Cluster Analysis Methods
- Clustering of Tissue Samples
- Screening and Clustering of Genes
- Discriminant Analysis
- Supervised Classification of Tissue Samples
- Linking Microarray Data with Survival Analysis
14OUTLINE OF TALK
- INTRODUCTION TO ARRAY TECHNOLOGY
- STATISTICAL ISSUES
- DIFFERENTIAL EXPRESSION
15Large-scale gene expression studies are not a
passing fashion, but are instead one aspect of
new work of biological experimentation, one
involving large-scale, high throughput assays.
Speed et al., 2002, Statistical Analysis of Gene
Expression Microarray Data, Chapman and Hall/ CRC
16Growth of microarray and microarray methodology
literature listed in PubMed from 1995 to 2003.
The category all microarray papers includes
those found by searching PubMed for microarray
OR gene expression profiling. The category
statistical microarray papers includes those
found by searching PubMed for statistical
method OR statistical techniq OR
statistical approach AND microarray OR gene
expression profiling.
17Mehta et al (Nature Genetics, Sept. 2004)
The field of expression data analysis is
particularly active with novel analysis
strategies and tools being published weekly, and
the value of many of these methods is
questionable. Some results produced by using
these methods are so anomalous that a breed of
forensic statisticians (Ambroise and McLachlan,
2002 Baggerly et al., 2003) who doggedly detect
and correct other HDB (high-dimensional biology)
investigators prominent mistakes, has been
created.
18A microarray is a new technology which allows the
measurement of the expression levels of
thousands of genes simultaneously.
- The entire genome of an organism can be probed at
a single - point in time.
- Possible due to
- sequencing of the genome (human, mouse, and
others) - (2) improvement in technology to generate
high-density - arrays on chips (glass slides or nylon
membrane).
19The Microarray Experiment Indirectly Measures
Levels of mRNA
Genes code for proteins through the intermediary
of mRNA, which is relatively unstable and short
lived.
mRNA directs the production of cellular proteins
(although protein synthesis and activation are
not regulated solely at the mRNA level in the
cell).
mRNA measurement can be used to estimate cellular
changes in response to external signals or
environmental changes.
Measuring mRNA is therefore critical to the
study of gene expression.
20The Central Dogma of Molecular Biology
DNA ? RNA ? Protein
21Principles of the Microarray Experiment
Relies on hybridization binding of single
stranded nucleic acid sequences to their
complement.
The probe DNA, of known sequences, is immobilized
on a glass slide, in wells or spots arranged in
a grid.
The sample (target) RNA is extracted, then
reverse transcribed to DNA and labelled with
fluorescent dye. This target DNA, made up of a
mixture of unknown sequences, is hybridized onto
the slide.
If present, complementary sequences in the target
DNA bind to the probe DNA. Signal from the bound
target is measured as fluorescence intensity,
corresponding to each spot (probe) on the slide.
22 Implicit in the use of microarrays is the
belief that intensity measured in a microarray
experiment is a measure of the abundance of the
mRNA in the sample. Conrad Burden who is
speaking next will show that there is not a
linear relationship between the intensity
(scanned fluorescence) and the concentration of
the mRNA abundance.
23Objectives of a Microarray Experiment
- observe changes in a gene in response to
external stimuli - (cell samples exposed to hormones, drugs,
toxins) - compare gene expressions between different
tissue types - (tumour vs normal cell samples)
- To gain understanding of
- function of unknown genes
- disease process at the molecular level
- Ultimately to use as tools in Clinical Medicine
for diagnosis, - prognosis and therapeutic management.
24(No Transcript)
25(No Transcript)
26(No Transcript)
27Heat Map of Genes in Group G1
28Heat Map of Genes in Group G2
29(No Transcript)
30(No Transcript)
31The technology is rapidly changing and advancing,
and the measurement of gene expression can be
expected to improve.
32The Microarray Technologies
Spotted Microarray
Affymetrix GeneChip
cDNAs, clones, or short and long
oligonucleotides deposited onto glass
slides Each gene (or EST) represented by its
purified PCR product Simultaneous analysis of
two samples (treated vs untreated
cells) provides internal control.
short oligonucleotides synthesized in situ onto
glass wafers Each gene represented multiply -
using 16-20 (preferably non-overlapping) 25-mers.
Each oligonucleotide has single-base mismatch
partner for internal control of hybridization
specifity.
relative gene expressions
absolute gene expressions
Each with its own advantages and disadvantages
33A Spotted Microarray Study
- Choosing cell populations e.g. tumour vs normal
cells. - mRNA extraction and reverse transcription to
cDNA. - Fluorescent labelling of cDNAs.
- Hybridization to a DNA microarray.
- Scanning the hybridized array.
- Interpreting the scanned image.
34(No Transcript)
35(No Transcript)
36Microarray image showing differentially expressed
genes Red spots High expression in target
labelled with cyanine 5 dye Green spots High
expression in target labelled with cyanine 3
dye Yellow spots Similar expression in both
target samples
37Pros and Cons of the Technologies
Spotted Microarray
Affymetrix GeneChip
More expensive yet less flexible Good for whole
genome expression analysis where genome of that
organism has been sequenced High quality with
little variability between slides Gives a
measure of absolute expression of genes
Flexible and cheaper Allows study of genes not
yet sequenced (spotted ESTs can be used to
discover new genes and their functions) Variabil
ity in spot quality from slide to slide Provide
information only on relative gene expressions
between cells or tissue samples
38Sources of Experimental Error
(1) In Target DNA samples
- Poor quality or insufficient sample RNA
- Varying efficiency of reverse transcription of
sample mRNAs - (reverse transcription bias)
- Fluorescent dyes bind with greater affinity to
different nucleotides (sequence bias) - Varying efficiency of hybridization for
different genes under different - experimental conditions (temperature, ionic
strength) - Sample contamination
(2) In Probe DNA fixed on the array
- Cross-linking of fixed DNA strands into
double-stranded forms - Defective chips, Scratches, Degradation of
probes - Poor probe design leading to cross hybridization
(3) Detection of bound target using Fluorescence
- Background light from non-specific hybridization
(noise) - Fluorescence is a non-linear phenomenon linear
only over limited range
39Experimental Design - Replicates
Technical replicates arrays that have been
hybridized to the same biological source (using
the same treatment, protocols, etc.) Biological
replicates arrays that have been hybridized to
different biological sources, but with the same
preparation, treatments, etc.
40Data Cleaning and Normalization
- Cleaning
- Image processing to extract information
- Missing value estimation
Normalization Methods for spotted arrays vs
affymetrix
41Rocke and Durbin (2001), Munson (2001), Durbin et
al. (2002), and Huber et al. (2002)
42Mixture of 2 normal components
43Mixture of 2 t components
44Mixture of 2 t components
45Mixture of 3 t components
46Microarrays present new problems for statistics
because the data is very high dimensional with
very little replication.
47(No Transcript)
48(No Transcript)
49 DECTECTION OF DIFFERENTIALLY
EXPRESSED GENES
50(No Transcript)
51Class 1
Class 2
52SIMPLEST METHOD FOLD CHANGE Calculate the log
ratio between the two classes and consider all
genes that differ by more than an arbitrary
cutoff value to be differentially expressed. A
two-fold difference is often chosen. Fold
change is not a statistical test.
53Two-Sample t-Statistic
54Two-Sample t-Statistic
5510, 000 genes
If all 10,000 genes were not differentially
expressed, then we would expect for P 0.05
for each test, 500 false positives P
0.05/10,000 for each test, .05 false positives
5610, 000 genes
If all 10,000 genes were not differentially
expressed, then we would expect for P 0.05
for each test, 500 false positives P
0.05/10,000 for each test, .05 false positives
57False Discovery Rate (FDR)
FDR Emphasize the proportion of false positives
among the identified differentially expressed
genes.
58Â Â
Possible Outcomes for N Hypothesis Tests
59Â Â
Possible Outcomes for N Hypothesis Tests
60Controlling FDR
Benjamini and Hochberg (1995)
Key papers on controlling the FDR
- Genovese and Wasserman (2002)
- Storey (2002, 2003)
- Storey and Tibshirani (2003a, 2003b)
- Storey, Taylor and Siegmund (2004)
- Black (2004)
- Cox and Wong (2004)
61q-VALUE
q-value of a gene j is expected proportion of
false positives when calling that gene
significant. P-value is the probability under
the null hypothesis of obtaining a value of the
test statistic as or more extreme than its
observed value. The q-value for an observed
test statistic can be viewed as the expected
proportion of false positives among all genes
with their test statistics as or more extreme
than the observed value.
62LIST OF SIGNIFICANT GENES
Call all genes significant if pj lt 0.05 or Call
all genes significant if qj lt 0.05 to produce a
set of significant genes so that a proportion of
them (lt0.05) is expected to be false (at least
for a large no. of genes not necessarily
independent)
63BRCA1 versus BRCA2-mutation positive tumours
(Hedenfalk et al., 2001)
BRCA1 (7) versus BRCA2-mutation (8) positive
tumours, p3226 genes P.001 gave 51
genes differentially expressed P0.0001 gave
9-11 genes
64Using qlt0.05, gives 160 genes are taken to be
significant. It means that approx. 8 of these
160 genes are expected to be false
positives. Also, it is estimated that 33 of the
genes are differentially expressed.
65(No Transcript)
66(No Transcript)
67Two-component model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
68Two-component model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
Then
is the posterior probability that gene j is not
differentially expressed.
691) Form a statistic tj for each gene, where a
large positivevalue of tj corresponds to a gene
that is differentially expressed across the
tissues.2) Compute the Pj-values according to
the tj and fit a mixture of beta distributions
(including a uniform component) to them where
the latter corresponds to the class of genes
that are not differentially expressed. or  Â
  Â
703) Fit to t1,...,tp a mixture of two normal
densities with a common variance, where the
first component has the smaller mean (it
corresponds to the class of genes that are not
differentially expressed). It is assumed that
the tj have been transformed so that they are
normally distributed (approximately). 4) Let
0(tj) denote the (estimated) posterior
probability that gene j belongs to the first
component of the mixture.
71If we conclude that gene j is differentially
expressed if
0(tj) c0,
then this decision minimizes the (estimated)
Bayes risk
where
72Estimated FDR
where
73(No Transcript)