Some Statistical Issues in Analyzing Gene Expression - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Some Statistical Issues in Analyzing Gene Expression

Description:

Some Statistical Issues in. Analyzing Gene Expression. Geoff McLachlan ... Genovese and Wasserman (2002) Storey (2002, 2003) Storey and Tibshirani (2003a, 2003b) ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 74
Provided by: wwwmath
Category:

less

Transcript and Presenter's Notes

Title: Some Statistical Issues in Analyzing Gene Expression


1
Some Statistical Issues in Analyzing Gene
Expression
Geoff McLachlan Department of Mathematics
Institute for Molecular Bioscience University of
Queensland http//www.maths.uq.edu.au/gjm
2
Institute for Molecular Bioscience, University
of Queensland
3
Todays Programme Topic is Analysis of Gene
Expressions
4
Analyzing Microarray Gene Expression Data
5
Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
6
Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression Data
The Analysis of Gene Expression Data
7
Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data
The Statistical Analysis of Gene Expression Data
8
Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression Data
Statistics for Microarrays
9
Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression
Data Statistics for Microarrays
Design and Analysis of DNA Microarrays
10
Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression
Data Statistics for Microarrays Design and
Analysis of DNA Microarrays
Exploration and Analysis of Microarrays
11
In the sequel, references to most of the material
presented can be found in my joint
book, McLachlan, Do, and Ambroise (2004),
Analyzing Microarray Gene Expression Data,
Hoboken, NJ Wiley.
12
Distribution of References by Year
13
Contents
  • Microarrays in Gene Expression Studies
  • Cleaning and Normalization
  • Some Cluster Analysis Methods
  • Clustering of Tissue Samples
  • Screening and Clustering of Genes
  • Discriminant Analysis
  • Supervised Classification of Tissue Samples
  • Linking Microarray Data with Survival Analysis

14
OUTLINE OF TALK
  • INTRODUCTION TO ARRAY TECHNOLOGY
  • STATISTICAL ISSUES
  • DIFFERENTIAL EXPRESSION

15
Large-scale gene expression studies are not a
passing fashion, but are instead one aspect of
new work of biological experimentation, one
involving large-scale, high throughput assays.
Speed et al., 2002, Statistical Analysis of Gene
Expression Microarray Data, Chapman and Hall/ CRC
16
Growth of microarray and microarray methodology
literature listed in PubMed from 1995 to 2003.
The category all microarray papers includes
those found by searching PubMed for microarray
OR gene expression profiling. The category
statistical microarray papers includes those
found by searching PubMed for statistical
method OR statistical techniq OR
statistical approach AND microarray OR gene
expression profiling.
17
Mehta et al (Nature Genetics, Sept. 2004)
The field of expression data analysis is
particularly active with novel analysis
strategies and tools being published weekly, and
the value of many of these methods is
questionable. Some results produced by using
these methods are so anomalous that a breed of
forensic statisticians (Ambroise and McLachlan,
2002 Baggerly et al., 2003) who doggedly detect
and correct other HDB (high-dimensional biology)
investigators prominent mistakes, has been
created.
18
A microarray is a new technology which allows the
measurement of the expression levels of
thousands of genes simultaneously.
  • The entire genome of an organism can be probed at
    a single
  • point in time.
  • Possible due to
  • sequencing of the genome (human, mouse, and
    others)
  • (2) improvement in technology to generate
    high-density
  • arrays on chips (glass slides or nylon
    membrane).

19
The Microarray Experiment Indirectly Measures
Levels of mRNA
Genes code for proteins through the intermediary
of mRNA, which is relatively unstable and short
lived.
mRNA directs the production of cellular proteins
(although protein synthesis and activation are
not regulated solely at the mRNA level in the
cell).
mRNA measurement can be used to estimate cellular
changes in response to external signals or
environmental changes.
Measuring mRNA is therefore critical to the
study of gene expression.
20
The Central Dogma of Molecular Biology
DNA ? RNA ? Protein
21
Principles of the Microarray Experiment
Relies on hybridization binding of single
stranded nucleic acid sequences to their
complement.
The probe DNA, of known sequences, is immobilized
on a glass slide, in wells or spots arranged in
a grid.
The sample (target) RNA is extracted, then
reverse transcribed to DNA and labelled with
fluorescent dye. This target DNA, made up of a
mixture of unknown sequences, is hybridized onto
the slide.
If present, complementary sequences in the target
DNA bind to the probe DNA. Signal from the bound
target is measured as fluorescence intensity,
corresponding to each spot (probe) on the slide.
22
Implicit in the use of microarrays is the
belief that intensity measured in a microarray
experiment is a measure of the abundance of the
mRNA in the sample. Conrad Burden who is
speaking next will show that there is not a
linear relationship between the intensity
(scanned fluorescence) and the concentration of
the mRNA abundance.
23
Objectives of a Microarray Experiment
  • observe changes in a gene in response to
    external stimuli
  • (cell samples exposed to hormones, drugs,
    toxins)
  • compare gene expressions between different
    tissue types
  • (tumour vs normal cell samples)
  • To gain understanding of
  • function of unknown genes
  • disease process at the molecular level
  • Ultimately to use as tools in Clinical Medicine
    for diagnosis,
  • prognosis and therapeutic management.

24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
Heat Map of Genes in Group G1
28
Heat Map of Genes in Group G2
29
(No Transcript)
30
(No Transcript)
31
The technology is rapidly changing and advancing,
and the measurement of gene expression can be
expected to improve.
32
The Microarray Technologies
Spotted Microarray
Affymetrix GeneChip
cDNAs, clones, or short and long
oligonucleotides deposited onto glass
slides Each gene (or EST) represented by its
purified PCR product Simultaneous analysis of
two samples (treated vs untreated
cells) provides internal control.
short oligonucleotides synthesized in situ onto
glass wafers Each gene represented multiply -
using 16-20 (preferably non-overlapping) 25-mers.
Each oligonucleotide has single-base mismatch
partner for internal control of hybridization
specifity.
relative gene expressions

absolute gene expressions
Each with its own advantages and disadvantages
33
A Spotted Microarray Study
  • Choosing cell populations e.g. tumour vs normal
    cells.
  • mRNA extraction and reverse transcription to
    cDNA.
  • Fluorescent labelling of cDNAs.
  • Hybridization to a DNA microarray.
  • Scanning the hybridized array.
  • Interpreting the scanned image.

34
(No Transcript)
35
(No Transcript)
36
Microarray image showing differentially expressed
genes Red spots High expression in target
labelled with cyanine 5 dye Green spots High
expression in target labelled with cyanine 3
dye Yellow spots Similar expression in both
target samples
37
Pros and Cons of the Technologies
Spotted Microarray
Affymetrix GeneChip
More expensive yet less flexible Good for whole
genome expression analysis where genome of that
organism has been sequenced High quality with
little variability between slides Gives a
measure of absolute expression of genes
Flexible and cheaper Allows study of genes not
yet sequenced (spotted ESTs can be used to
discover new genes and their functions) Variabil
ity in spot quality from slide to slide Provide
information only on relative gene expressions
between cells or tissue samples

38
Sources of Experimental Error
(1) In Target DNA samples
  • Poor quality or insufficient sample RNA
  • Varying efficiency of reverse transcription of
    sample mRNAs
  • (reverse transcription bias)
  • Fluorescent dyes bind with greater affinity to
    different nucleotides (sequence bias)
  • Varying efficiency of hybridization for
    different genes under different
  • experimental conditions (temperature, ionic
    strength)
  • Sample contamination

(2) In Probe DNA fixed on the array
  • Cross-linking of fixed DNA strands into
    double-stranded forms
  • Defective chips, Scratches, Degradation of
    probes
  • Poor probe design leading to cross hybridization

(3) Detection of bound target using Fluorescence
  • Background light from non-specific hybridization
    (noise)
  • Fluorescence is a non-linear phenomenon linear
    only over limited range

39
Experimental Design - Replicates
Technical replicates arrays that have been
hybridized to the same biological source (using
the same treatment, protocols, etc.) Biological
replicates arrays that have been hybridized to
different biological sources, but with the same
preparation, treatments, etc.
40
Data Cleaning and Normalization
  • Cleaning
  • Image processing to extract information
  • Missing value estimation

Normalization Methods for spotted arrays vs
affymetrix
41
Rocke and Durbin (2001), Munson (2001), Durbin et
al. (2002), and Huber et al. (2002)
42
Mixture of 2 normal components
43
Mixture of 2 t components
44
Mixture of 2 t components
45
Mixture of 3 t components
46
Microarrays present new problems for statistics
because the data is very high dimensional with
very little replication.
47
(No Transcript)
48
(No Transcript)
49
DECTECTION OF DIFFERENTIALLY
EXPRESSED GENES
50
(No Transcript)
51
Class 1
Class 2
52
SIMPLEST METHOD FOLD CHANGE Calculate the log
ratio between the two classes and consider all
genes that differ by more than an arbitrary
cutoff value to be differentially expressed. A
two-fold difference is often chosen. Fold
change is not a statistical test.
53
Two-Sample t-Statistic
54
Two-Sample t-Statistic
55
10, 000 genes
If all 10,000 genes were not differentially
expressed, then we would expect for P 0.05
for each test, 500 false positives P
0.05/10,000 for each test, .05 false positives
56
10, 000 genes
If all 10,000 genes were not differentially
expressed, then we would expect for P 0.05
for each test, 500 false positives P
0.05/10,000 for each test, .05 false positives
57
False Discovery Rate (FDR)
FDR Emphasize the proportion of false positives
among the identified differentially expressed
genes.
58
  
Possible Outcomes for N Hypothesis Tests
59
  
Possible Outcomes for N Hypothesis Tests
60
Controlling FDR
Benjamini and Hochberg (1995)
Key papers on controlling the FDR
  • Genovese and Wasserman (2002)
  • Storey (2002, 2003)
  • Storey and Tibshirani (2003a, 2003b)
  • Storey, Taylor and Siegmund (2004)
  • Black (2004)
  • Cox and Wong (2004)

61
q-VALUE
q-value of a gene j is expected proportion of
false positives when calling that gene
significant. P-value is the probability under
the null hypothesis of obtaining a value of the
test statistic as or more extreme than its
observed value. The q-value for an observed
test statistic can be viewed as the expected
proportion of false positives among all genes
with their test statistics as or more extreme
than the observed value.
62
LIST OF SIGNIFICANT GENES
Call all genes significant if pj lt 0.05 or Call
all genes significant if qj lt 0.05 to produce a
set of significant genes so that a proportion of
them (lt0.05) is expected to be false (at least
for a large no. of genes not necessarily
independent)
63
BRCA1 versus BRCA2-mutation positive tumours
(Hedenfalk et al., 2001)
BRCA1 (7) versus BRCA2-mutation (8) positive
tumours, p3226 genes P.001 gave 51
genes differentially expressed P0.0001 gave
9-11 genes
64
Using qlt0.05, gives 160 genes are taken to be
significant. It means that approx. 8 of these
160 genes are expected to be false
positives. Also, it is estimated that 33 of the
genes are differentially expressed.
65
(No Transcript)
66
(No Transcript)
67
Two-component model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
68
Two-component model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
Then
is the posterior probability that gene j is not
differentially expressed.
69
1) Form a statistic tj for each gene, where a
large positivevalue of tj corresponds to a gene
that is differentially expressed across the
tissues.2) Compute the Pj-values according to
the tj and fit a mixture of beta distributions
(including a uniform component) to them where
the latter corresponds to the class of genes
that are not differentially expressed. or   
     
70
3) Fit to t1,...,tp a mixture of two normal
densities with a common variance, where the
first component has the smaller mean (it
corresponds to the class of genes that are not
differentially expressed). It is assumed that
the tj have been transformed so that they are
normally distributed (approximately). 4) Let
0(tj) denote the (estimated) posterior
probability that gene j belongs to the first
component of the mixture.
71
If we conclude that gene j is differentially
expressed if
0(tj) c0,
then this decision minimizes the (estimated)
Bayes risk
where
72
Estimated FDR
where
73
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com