Title: Introduction to Microarray Gene Expression
1Introduction to Microarray Gene Expression
- Shyamal D. PeddadaBiostatistics Branch
- National Inst. Environmental
- Health Sciences (NIH)Research Triangle Park, NC
2Outline of the four talks
- A general overview of microarray data
- Some important terminology and background
- Various platforms
- Sources of variation
- Normalization of data
- Analysis of gene expression data - Nominal
explanatory variables - Two types of explanatory variables
- Scientific questions of interest
- A brief discussion on false discovery rate (FDR)
analysis - Some existing methods of analysis.
3Outline of the four talks
- Analysis of ordered gene expression data
- Common experimental designs
- Some existing statistical methods
- An example
- Demonstration of ORIOGEN
- Some open research problems
- Analysis of data from cell-cycle experiments
- Some background on cell-cycle experiments
- Modeling the data
- Data from multiple experiments
- Some open research problem
4Talk 1 An overview of microarray data
5To perform statistical analysis of any given data
- It is important to understand all sources of (i)
bias, (ii) variability. - Some basic understanding of the underlying
technology! - Understand the sampling/experimental design
6Some Important Terminology and Background
7Central Dogma of Molecular Biology
8Some background terminologyDNA and RNA
- DNA (Deoxyribonucleic acid) - Contains genetic
code or instructions for the development and
function living organisms. It is double stranded. - Four Nucleotides (building blocks of DNA)
- Adenine (A), Guanine (G),
- Thymine (T), Cytosine (C)
- Base pairs (A, T) (G, C)
- E.g. 5 ---AAATGCAT---3
- 3 ---TTTACGTA---5
9Some background terminologyDNA and RNA
- RNA (Ribonucleic acid) - transcribed (or copied)
from DNA. It is single stranded. (Complimentary
copy of one of the strands of DNA) - RNA polymerase - An enzyme that helps in the
transcription of DNA to form RNA. - Four Nucleotides (building blocks of DNA)
- Adenine (A), Guanine (G),
- Uracil (U), Cytosine (C)
- Base pairs (A, U) (G, C)
10Some background terminologyTypes of RNA
- Types of RNA - (transfer) tRNA,
- (ribosomal) rRNA, etc.
- mRNA - messenger RNA. Carries information from
DNA to ribosomes where protein synthesis takes
place (less stable than DNA).
11Some background terminology Oligos
- Oligonucleotide - a short segment of DNA
consisting of a few base pairs. In short it is
commonly called Oligo. - mer - unit of measurement for an Oligo. It is
the number of base pairs. So 30 base pair Oligo
would be 30-mer long.
12Some background terminology Probes
- cDNA - complimentary DNA. DNA sequence that is
complimentary to the given mRNA. - Obtained using an enzyme called reverse
transcriptase. - Probes - a short segment of DNA (about 100-mer
or longer) used to detect DNA or RNA that
compliments the sequence present in the probe.
13Some background terminologyBlots - Origins of
Microarrays
- Southern blot (Edwin Southern, 1975 J. Molec.
Biol.) - A method used to identify the presence of a DNA
sequence in a sample of DNA. - Western blot (immunoblot)
- to identify a specific protein from a tissue
extract.
14Some background terminology
- Southwestern blot
- to identify and characterize DNA-binding
proteins. - Northern blot
- A method used to study the gene expression from a
sample of mRNA.
15Microarrays
16Northern blot Vs Microarray
Microarray Northern blot
Rate of expression analysis Thousands of genes at a time (High throughput) Few genes at a time
Automation Automation possible Manual
Scope Allows to explore relationships among several 100s of genes at the same time Limited
17What is a Microarray?
- Sequences from thousands of different genes are
immobilized, or attached, at fixed locations. - Spotted, or actually synthesized directly onto
the support.
18Microarray Technology
- Two color dye array (Spotted array)
- Spotted cDNA microarrays
- Spotted oligo microarrays
- Single dye array
- In situ oligo microarrays
19Microarray Technology
20Spotted Microarrays
21Spotted DNA Microarray
- Slides carrying spots of target DNA are
hybridized to fluorescently labeled cDNA from
experimental and control cells and the arrays are
imaged at two or more wavelengths - Expression profiling involves the hybridization
of fluorescently labeled cDNA, prepared from
cellular mRNA, to microarrays carrying thousands
of unique sequences.
22Spotted DNA Microarray
- Spotted DNA array is typically home made so you
need to think about - cDNA or Oligo
- Location of the Oligo in a given gene
- Oligo length - number of bp?
23Spotted DNA Microarray
- Gene expression
- Y lt 0 gene is over expressed in green labeled
sample compared to red-labeled sample - Y 0 gene is equally expressed in both samples
- Y gt 0 gene is over expressed in red-labeled
sample compared to green labeled sample
24Single Dye Microarrays
25Major Commercial Platforms
- More than 50 companies are currently offering
various DNA microarray platforms, reagents and
software - Affymetrix dominated the marker for many years
Agilent has one and two-color microarray platform
26Affymetrix GeneChip
- Each gene is represented by 11 to 20 oligos of
25-mers - Probe An oligo of 25-mer
- Probe Pair a PM and MM pair
- Perfect match (PM) A 25-mer complementary to a
reference sequence of interest (part of the gene) - Mismatch (MM) same as PM with a single base
change for the middle (13th) base (G lt-gt C, A lt-gt
T) - Probe set a collection of probe-pairs (11 to 20)
related to a fraction of gene
27Affymetrix call for the presence of a signal
- Affymetrix detection algorithm uses probe pair
intensities to obtain detection p-value - Using this p-value they decide whether the signal
- is
- present, marginal or absent
28Affy call
- Detection of p-value
- Calculate Kendalls tau T for each probe pair
- T (PM-MM) / (PMMM)
- Determine the statistical significance of the
gene by computing the p-value.
29Affy call
Ref Affymetrix Technical Manual
30Affymetrix Vs Illumina
Ref Pan Du Simon Lin
31(No Transcript)
32Which Platform to Choose?
- Every platform has its unique feature
- Choose platform based on
- Nature of the study
- Amount of available RNA
- Cost
- Platform comparison in MAQC study
33MAQC Project
- Objective To generate a set of quality control
tools for microarray research community - 137 participants representing 51 organizations
- Gene expression from two distinct RNA samples
(total 4 samples) - Sample A Universal Human Reference
RNA(UHRR)100 - Sample B Human Brain Reference RNA(HBRR) 100
- Sample C 75 UHRR 25 HBRR
- Sample D 25 UHRR 75 HBRR
34Microarray Data Analysis
35Why Normalize Data?
- To calibrate/adjust data so as to reduce or
eliminate the effects arising from variation in
technology and other sources rather than due to
true biological differences between test groups.
36Sources of bias/variation
- Tissue or cell lines
- mRNA
- It can degrade over time - so there is a
potential batch effect if portions of experiment
are performed at different times - Purity and quantity
- Dye color effect (spotted arrays)
- Variation due to technology - is substantially
reduced with improved technology - Etc.
37A useful graphical representation of data
38A useful graphical representation of data
- Let its spectral decomposition be given by
- where
-
39A useful graphical representation of data
40Common Normalization Methods
- Internal Control Normalization
- Global Normalization
- Linear Normalization (Spotted arrays)
- Non-linear Normalization Method (Spotted arrays)
- LOWESS curve. - ANOVA
- COMBAT (for batch effect)
41Internal control normalization(Housekeeping
gene(s))
- Expression of each gene is measured relative to
the average of house keeping genes. - Basic assumption Expression of housekeeping
genes does not change. - Disadvantage
- House keeping genes may be highly expressed
sometimes. Unexpected regulation of house keeping
gene(s) leads to misinterpretation
42Global Normalization
- Basic assumption
- Mean/Median expression ratio of all monitored
mRNAs is constant across a chip. - Regression of
- In simple terms the log ratios are corrected by a
common mean or median - This method can also be applied to single Dye data
43Linear Normalization(for spotted arrays)
- Basic assumption
- Mean/Median expression ratio of all monitored
mRNAs depends upon the average intensity - Regression of
44Non-Linear Normalization(for spotted arrays)
- Basic assumption
- Mean/Median expression ratio of all monitored
mRNAs depends upon the average intensity - Regression of
- Where is estimated by the
robust scatter plot - smoother LOWESS (Locally WEighted Scatterplot
Smoothing)
45Analysis of Variance (ANOVA)
- Standard Analysis of Variance model
- Response variable - Gene expression
- Explanatory variables
- Dye color
- Batch
- Other potential effects?
- Advantage Statistically significant
- genes can be identified while controlling for the
- various experimental conditions/factors.
46Some important experimental designs
- Pooled Samples versus Separate samples
- Sometimes there may not be sufficient biological
sample/specimen from a given animal. In such
cases biological samples are pooled from several
identical animals to form a sample.
47An example of a pooling design(for each
treatment group)
- Subjects Pool Observations
-
(Microarray chips)
48The pooling design
- Subjects Pool Observations
-
(Microarray chips) - 9 3 6
- (3 per pool)
- More generally
- n p m
- (rn/p per pool)
49The standard design
- Subjects Pool Observations
-
(Microarray chips) - 9 9 9
- (r1)
- More generally
- n pn mn
- (r1)
50Some issues
- What are the underlying parameters?
- Effect of pooling on power.
- The basic assumption. Validity of the assumption.
51Parameters
- Total variation in the expression of a gene can
be decomposed in to - Biological variation
- Technical variation
- Biological samples (n)
- Number of pools (p)
- Biological samples per pool (rn/p)
- Observed number of samples (e.g. microarrays) (m)
52Some comments about pooling
- Variance of the estimated mean expression of a
gene depends on - number of pools (p)
- number of bio samples per pool (r)
- number of arrays (m)
- biological variation
- Technical variation.
- Pooling works well when the biological variation
in the gene - expression is substantially larger than the
technical variation.
53Power comparisons
- Bio Micro Pool size Power
- 5/group 5/group 1 (Standard design) 0.81
- 6/group 6/group 1 (Standard design) 0.95
- 6/group 3/group 2 (i.e 3 pools/group)
0.30 - 8/group 4/group 2 (i.e. 4 pools/group)
0.80 - 10/group 5/group 2 (i.e. 5 pools/group)
0.98 - Zhang and Gant (2005)
54Power comparisons
- Conditions of the simulation study
- Biological variation is 4 times the technical
variation. - False positive rate is 0.001.
- Detect 2-fold expression.
- Data are normally distributed.
55A fundamental assumption
- Biological averaging
- Suppose an experiment consists of pooling r
samples. Then - the expression of a gene in the pooled sample is
assumed to - be the average of the genes expression in the
r samples. - This assumption need not be true especially if
the expression - values are transformed non-linearly.
56Some important experimental designs
- Reference designs (Spotted array)
- Each treatment sample is hybridized against a
common reference control. - Loop designs (Spotted array)
- Suppose we have a control and three experimental
groups A, B and C. Then hybridize Control and A,
A with B, B with C and C with A.
57Data Analysis - Preliminaries
- Normalization
- Transformation of data (usual methods)
- Perhaps first fit ANOVA and plot the residuals
- Log transformation
- Square root
- More generally, Box-Cox family of transformations
- Identify potential outliers in the data (again,
perhaps use the residuals)
58Data Analysis
- Method of Analysis depends upon the scientific
question of interest. - In the next three lectures we describe several
general methods and illustrate some using real
data!