Title: Gene expression data: Questions, answers and statistics
1Gene expression dataQuestions, answers and
statistics
- Terry Speed and Yee Hwa Yang
- Department of Statistics UC Berkeley
- Genetics and Bioinformatics, Walter Eliza Hall
Institute of Medical research
2Overview
- Questions involving microarray data.
- Different experimental designs
- Case studies, including
- Olfactory epithelium,
- Olfactory bulb,
- Identification of differentially expressed genes,
- Pattern searching.
3Questions and answers a point of view
- Biological questions first, then statistical
methods (design, analysis) and thinking, leading
to tentative answers, together with an assessment
of the uncertainty in those answers - Rather than beginning with
- Purely exploratory analyses, or modelling
either processes or data - Something of each of the last two comes into
most statistical analyses, but only after
focussing on biological questions
4Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
5Which genes are (relatively) up/down regulated?
- Samples liver tissue from each of two kinds of
mice, e.g. KO vs. WT, or mutant vs. WT
? n
T
C
? n
- For each gene form the t statistic
- average of n trt Ms
- sqrt(1/n (SD of n trt Ms)2)
6Which genes are (relatively) up/down regulated?
- Samples as before, but also pooled control
liver tissue
? n
T
C
? n
C
C
- For each gene form the t statistic
- average of n trt Ms - average of n ctl Ms
- sqrt(1/n (SD of n trt Ms)2 (SD of n ctl Ms)2)
7Multiple comparisons of interest
T2
T3
T4
T1
x 2
x 2
x 2
x 2
C
- Samples Liver tissue from mice treated by
cholesterol modifying drugs. - Question 1 Find genes that respond differently
between the treatment and the control. - Question 2 Find genes that respond similarly
across two or more treatments relative to control.
8Interaction?
- Samples treated cell lines at 4 time points
(30 minutes, 1 hour, 4 hours, 24 hours) - Question Which genes contribute to the enhanced
inhibitory effect of OSM when it is combined with
EGF? Role of time?
ctl
OSM
? 4 times
OSM EGF
EGF
9Gene Expression Data
- Gene expression data on 1,2,3,4,5,... genes for 5
slides
Slide (experiment)
slide1 slide2 slide3 slide4 slide5 1 0.46
0.30 0.80 1.51 0.90 2 -0.10 0.49 0.24
0.06 0.46 3 0.15 0.74 0.04 0.10
0.20 4 -0.45 -1.03 -0.79 -0.56 -0.32 5 -0.06
1.06 1.35 1.09 -1.09
Genes
Gene expression level of gene i on slide j
Log2( Red intensity / Green intensity)
Sometimes a common reference, e.g. green,
sometimes not.
10Molecular development of sensory maps
11Olfactory epithelium
- GOAL Exploratory study to identify genes with
altered expression between zone 1 and zone 4 of
the olfactory epithelium for new born (P0) and
adult (A) mice. - Tissue samples
- P01 Zone 1 of epithelium from P0 mouse.
- P04 Zone 4 of epithelium from P0 mouse.
- A1 Zone 1 of epithelium from adult mouse.
- A4 Zone 4 of epithelium from adult mouse.
- Probes 19,000 mouse cDNAs.
12Red stained region is the olfactory epithelium
13Factorial Design as completed
Age Effect
2
A1
P01
4
Zone Effect
1
3
5
P04
A 4
14Layout of the cDNA microarrays
- Made in Ngai lab, UC Berkeley
- Mouse ESTs, 19,200 spots.
- Two different print groups, each with
- 4 x 4 grid, each with
- 25 x24 spots
- Controls on the first 2 rows of each grid.
77
pg1
pg2
15Two slides
P04 vs. P01 (pg2)
A1 vs. P01 (pg2)
16Preprocessing - Image Analysis
1. Addressing locate centers
2. Segmentation classification of pixels either
as signal or background. using seeded region
growing).
3. Information extraction for each spot of the
array, calculates signal intensity pairs,
background and quality measures.
Results from SRG from P04 vs. P01
17Preprocessing after image analysis
- Where necessary, we carry out
- Colour normalization (location and scale)
within slides, possibly within pin-groups, - Scale normalization between slides,
- A variety of other adjustments, e.g. to remove
spatial artifacts.
18Factorial design
m
ma
Different ways of estimating parameters. e.g. Z
effect. 1 (m z) - (m) z 2 - 5 ((m
a) - (m)) -((m a)-(m z)) (a) - (a z)
z 4 3 - 5 z
2
A1
P01
4
1
3
5
P04
A 4
mz
mzaza
How do we combine the information?
19Regression analysis
Define a matrix X so that E(M)X?, see below. Use
least squares estimate for z, a, za for each
gene.
20Estimates of zone effects log(zone 4 / zone1) vs
ave A
gene A
gene B
average log v(RG)
21Estimates of zone effects vs SE
Z effect
Log2(SE)
22Estimates of age effects vs estimates of zone
effects
Zone Age Zone ? Age
23Top 50 genes from each effect
Zone . Age interaction
Age
19
0
48
29
2
0
19
Zone
24In situ hybridization image
Gene A (up-regulated in zone 4)
25Gene B (up-regulated in zone1)
26(No Transcript)
27Continuation the Mouse olfactory bulb
281-year old statement by our collaborator
- Comparison of large regions of olfactory bulb
fails to yield molecular differences. - Molecules involved in target recognition may be
expressed in a limited subset of cells. - A new approach is required that possesses high
sensitivity and throughput of analysis.
29The olfactory bulb experiments
M
A
V
D
P
L
- Samples tissues from different regions of the
olfactory bulb. - Question 1 differences between different
regions. - Question 2 identify genes with pre-specified
patterns across regions. - Note novel design (controversial?)
30Regression analysis
Define a matrix X so that E(M)X? Use least
squares estimates for A-L, P-L, D-L, V-L, M-L.
31Contrasts
- -- We can estimate all 15 different comparisons
directly and/or indirectly - e.g. D - M (D - L) - (M - L)
- -- For every gene we have a pattern based on the
15 different comparisons. - e.g. Gene 5699,
32Genes that share the same pattern
- Find genes with smallest Euclidean distance to
gene 5699 (whatever it is another story). - The second gene is a replicate of the first.
33 34(No Transcript)
35(No Transcript)
36How the question got refined
- After the design and carrying out of the
experiment, and the initial analysis and
follow-up in situ hybridizations to confirm our
findings, we realized we had failed to perceive
the most interesting question, - which was
- Find genes whose expression patterns show
(spatial) restriction across the bulb, i.e. not
just gradients (differential expression), but
localization.
37(No Transcript)
38Acknowledgments
- Statistical collaborators
- Yee Hwa Yang (Berkeley)
- Sandrine Dudoit (Stanford)
- Ingrid Lönnstedt (Uppsala)
- Natalie Thorne (WEHI)
- CSIRO Image Analysis Group
- Michael Buckley
- Ryan Lagerstorm
- Ngai Lab (Berkeley)
- Cynthia Duggan
- Jonathan Scolnick
- Dave Lin
- Vivian Peng
- Percy Luu
- Elva Diaz
- John Ngai
- LBNL
- Matt Callow
39- Some web sites
- Technical reports, talk, software etc.
- http//www.stat.berkeley.edu/users/terry/zarray/Ht
ml/ - Statistical software R GNUs S
http//lib.stat.cmu.edu/R/CRAN/ - Packages within R environment
- -- Spot http//www.cmis.csiro.au/iap/spot.htm
- -- SMA (statistics for microarray analysis)
http//www.stat.berkeley.edu/users/terry/zarray/Ht
ml