Title: MIcroarray Data Analysis System version 2'19
1MIcroarray Data Analysis System(version 2.19)
Wei Liang October 2004
2Microarray Data Flow
Image Analysis
.tiff Image File
Raw Gene Expression Data
Gene Annotation
Normalization / Filtering
Normalized Data with Gene Annotation
Expression Analysis
Data Entry / Management
Interpretation of Analysis Results
3MIDAS is a Normalization and Filtering tool
for microarray data analysis!
4MIDAS is a Normalization and Filtering tool
for microarray data analysis!
Serves as a data pre-processor for clustering
analysis (MeV).
5Why Normalization and Filtering?
.tiff Image Files
Raw Data File
Sample1 mRNA
Cy3 intensity
RT
RT
cDNA array
Sample2 mRNA
Cy5 intensity
6Why Normalization and Filtering?
- The hypothesis underlying microarray analysis is
that the measured intensities for each arrayed
gene represent its relative expression level.
- We use these intensities to identify biologically
relevant patterns of expression by comparing
measured levels between states on a gene-by-gene
basis.
- However, before the levels can be appropriately
compared, one generally performs a number of
transformations on the data to eliminate
questionable or low quality data, to adjust the
measured intensities to facilitate comparisons,
and to select those genes that are significantly
differentially expressed.
7MIDAS data analysis methods
- 8 normalization/transformation methods
Total Intensity normalization
Ratio Statistics normalization
LOWESS (Locfit) normalization
Standard deviation regularization
Iterative linear regression normalization
In-slide replicates analysis
Iterative log mean centering normalization
MA-ANOVA
- 10 quality control filtering methods
Flip-dye consistency checking
Low intensity filter
Spot QC flag checking
Ratio Statistics confidence interval checking
Signal/Noise checking
Invalid-intensity checking
Cross-file-trim
- 3 significant genes identification methods
Slice analysis (non-statistical)
Cross-slide replicates t-test (statistical)
Cross-slide one-class SAM (statistical)
8Graphical scripting language
9Graphical scripting language
- Read input files
-
- Define analysis
- pipeline and set
- parameters for
- each analysis module
- Write output files
10MIDAS data analysis methods
- 8 normalization/transformation methods
Total Intensity normalization
Ratio Statistics normalization
LOWESS (Locfit) normalization
Standard deviation regularization
Iterative linear regression normalization
In-slide replicates analysis
Iterative log mean centering normalization
MA-ANOVA
- 10 quality control filtering methods
Flip-dye consistency checking
Low intensity filter
Spot QC flag checking
Ratio Statistics confidence interval checking
Signal/Noise checking
Invalid-intensity checking
Cross-file-trim
- 3 significant genes identification methods
Slice analysis (non-statistical)
Cross-slide replicates t-test (statistical)
Cross-slide one-class SAM (statistical)
11Sample data
12LOWESS (Locfit) normalization
R-I plot logRatio vs. logIntensityProduct
- Tilted tails at low intensity end and high
intensity end
2. Mean not centered at 0 intensity dependent
13LOWESS (Locfit) normalization
Gene X
Exp factor
Bio factor
- If Cy3, Cy5 equally expressed, log2(Cy5/Cy3) 0
- Two factors contributed to the up-regulated gene
X
1. Biological factors (we are interested)
2. Experimental factors, e.g. different
sensitivity to
red and green lasers (we are NOT
interested and
desire to get rid of.)
14LOWESS (Locfit) normalization
Gene X
Exp factor
Bio factor
15LOWESS (Locfit) normalization
- Local linear regression model
- Tri-cube weight function
- Least Squares
Estimated values of log2(Cy5/Cy3) as function of
log10(Cy3Cy5)
16LOWESS (Locfit) normalization
Use the estimated curve y(xi) to correct raw data
log2(Ri/Gi) log2(Ri/Gi) y(xi) log2(Ri/Gi)
log2(Ri/Gi) log22y(xi) log2(Ri/Gi)
log2(Ri/Gi 1/2y(xi))
Ri Ri Gi Gi 2 y(xi)
17LOWESS (Locfit) normalization
LOWESS-corrected RI plot
18Standard deviation regularization
Assumption Within each block and each slide,
spots should have the same spread for
log(Cy5/Cy3, 2) values
SD-Reg scales the (Cy3, Cy5) intensity pair for
each spot so that the spot sets within each block
or each slide will have the same standard
deviation as other blocks or slides.
19Standard deviation regularization
- Let aij be the raw log ratio for the jth spot
in ith block (or slide)
aij be the scaled log ratio for the jth spot in
ith block (or slide)
where Nj denotes the number of genes ith block or
ith slide, M denotes the number of blocks or
slides, aij denotes the log ratio mean of ith
block (or ith slide)
20Standard deviation regularization
21Flip dye replicates consistency filter
- Flip dye experiments help reduce random error
-
- The intensities in the file pair are flipped,
i.e. - R1/G1 G2/R2
- or
- R1 G2, G1 R2
-
22Flip dye replicates consistency filter
- Calculate expression levels for all genes in the
flip-dye pair - Filter genes with inconsistent expression levels
between - flip-dye replicates
- For those genes passed the consistency checking,
take geometric mean for the corresponding
intensities from the replicated pairs -
How consistency is measured between replicates?
23Flip dye replicates consistency filter
100 consistency
24Flip dye replicates consistency Filter
Regardless of datasets, always cut the same
percentage for the same ?
SD cut
The percentage to cut depends on the specified
log-ratio consistency range
-1lt lt 1
Threshold cut
1/2 lt lt 2
25Flip dye replicates consistency filter
- Calculate expression levels for all genes in the
flip-dye pair - Filter genes with inconsistent expression levels
between - flip-dye replicates
- For those genes passed the consistency checking,
take geometric mean for the corresponding
intensities from the replicated pairs -
26Slice Analysis filter
- Remove genes with z-scores beyond an interested
range -
27Slice Analysis filter
- Remove genes with z-scores beyond an interested
range -
28Slice Analysis filter
- Sliding the window along the log(IntensityProduct
) axis
- Calculate logRatioMean and logRatioSD of data
points within each slice window
- Calculate Z-scores of each data point
Z-score (logRatio-logRatioM
ean)/ logRatioSD
- Trim data with Z-scores beyond interested range
29Slice Analysis filter
30Analysis packaging
myAnalysis.prj
31MIDAS graphing
32MIDAS graphing
R-I plot (.prc)
FlipDye Diagnostic plot (.rrc)
Intensity plot (.ity, .lty)
Z-score Distribution plot (.his)
SAM plot (.sam)
Box plot (.box)
33MIDAS data viewer
34Statistical significant genes identification
methods
Two methods implemented in this release of MIDAS
- Cross-slide replicates one-class T-test
- Cross-slide replicates one-class SAM
35SAM (Significance Analysis of Microarrays)
A statistical technique for finding significant
genes in a set of microarray experiments.
Reference
Tusher, V.G., R. Tibshirani and G. Chu. 2001.
Significance analysis of microarrays applied to
the ionizing radiation response. Proceedings of
the National Academy of Sciences USA 98
5116-5121.
Designs
- one-class (available in this release)
36SAM (Significance Analysis of Microarrays)
One-class SAM
Identify genes whose mean expression across
experiments are different from a user-specified
mean.
- Assign a score (d) to each gene based on its
change in expression relative
to the standard deviation of repeated
measurements for the gene
- Genes with scores gt a threshold (?) are deemed
potentially significant
- For these deemed potentially significant
genes, the proportion of
them likely to have been wrongly identified by
chance, or
False Discovery Rate (FDR) is estimated
- The goal is picking a set of differentially
expressed genes with a
user-satisfied FDR
37SAM (Significance Analysis of Microarrays)
positively significant genes
FDR
? adjustment
38Automated report generation
39Automated report generation
40TM4 MIDAS web page
http//www.tigr.org/software/tm4/midas.html
http//www.tm4.org/midas.html