Title: Introduction to Microarray and Data Analysis
1Introduction to Microarray andData Analysis
- By
- Han-Yu Chuang
- 03/11/04
2Biological background Molecular Biology
3The Central Dogmaof Molecular Biology
4Basic principles in physics, chemistry and
biology
Principles Known?
Physics Matter
Chemistry Compound
Biology Organism
Elementary Particles Yes
Genes No
Elements Yes
Every biological rule has exceptions!
5Measuring Gene Expression
Idea measure the amount of mRNA to see which
genes are being expressed in (used by) the cell.
Measuring protein would be more direct, but is
currently harder.
6Microarrays provide a means to measure gene
expression
7How to measure gene expression?
8A simple idea Northern Blot
9 Technology Advanced
10What is Microarray?
- Put a large number (100K) of cDNA sequences or
synthetic DNA oligomers onto a glass slide (or
other substrate) in known locations on a grid. - Label an RNA sample and hybridize
- Measure amounts of RNA bound to each square in
the grid
11Imagination on Microarray
12(No Transcript)
13Basic principles
- Main novelty is one of scale
- hundreds or thousands of probes rather than tens
- Probes are attached to solid supports
- Robotics are used extensively
- Informatics is a central component at all stages
14Major technologies
- cDNA probes (gt 200 nt), usually produced by PCR,
attached to either nylon or glass supports - Oligonucleotides (25-80 nt) attached to glass
support - Oligonucleotides (25-30 nt) synthesized in situ
on silica wafers (Affymetrix) - Probes attached to tagged beads
15Areas Being Studied with Microarrays
- Differential gene expression between two (or
more) sample types - Similar gene expression across treatments
- Tumor sub-class identification using gene
expression profiles - Classification of malignancies into known classes
- Identification of marker genes that
characterize different tumor classes - Identification of genes associated with clinical
outcomes (e.g. survival)
16Applications
- Pathway Inference
- (Gene regulatory network prediction)
- Disease detection
- (probe detection classification)
17Principal uses of chips
- Genome-scale gene expression analysis
- Differentiation
- Responses to environmental factors
- Disease processes
- Effects of drugs
- Detection of sequence variation
- Genetic typing
- Detection of somatic mutations (e.g. in
oncogenes) - Direct sequencing
18cDNA chips
- Probes are cDNA fragments, usually amplified by
PCR - Probes are deposited on a solid support, either
positively charged nylon or glass slide - Samples (normally poly(A) RNA) are labelled
using fluorescent dyes - At least two samples are hybridized to chip
- Fluorescence at different wavelengths measured by
a scanner
19Standard protocol for comparative hybridization
20cDNA microarray experiments
- mRNA levels compared in many different contexts
- Different tissues, same organism (brain v.
liver) - Same tissue, same organism (ttt v. ctl, tumor v.
non-tumor) - Same tissue, different organisms (wt v. ko, tg,
or mutant) - Time course experiments (effect of ttt,
development) - Other special designs (e.g. to detect spatial
patterns).
21Web animation of a cDNA microarray experiment
http//www.bio.davidson.edu/courses/genomics/chip/
chip.html DNA Microarray Technique
22Yeast genome on a chip
23Brief outline of steps for producing a microarray
- cDNA probes attached or synthesized to solid
support - Hybridize targets
- Scan array
24- Using Microarray with cDNA or oligonucleotide
Building the Chip
MASSIVE PCR
PCR PURIFICATION and PREPARATION
PREPARING SLIDES
PRINTING
Preparing RNA
Hybing the Chip
POST PROCESSING
CELL CULTURE AND HARVEST
ARRAY HYBRIDIZATION
RNA ISOLATION
DATA ANALYSIS
PROBE LABELING
cDNA PRODUCTION
25cDNA microarrays
cDNA clones
26cDNA microarrays
- Compare the genetic expression in two samples of
cells
PRINT cDNA from one gene on each spot
SAMPLES cDNA labelled red/green
e.g. treatment / control normal / tumor
tissue
27HYBRIDIZE Add equal amounts of labelled cDNA
samples to microarray.
SCAN
Laser
Detector
28Quantification of expression
- For each spot on the slide we calculate
- Red intensity Rfg - Rbg
- (fg foreground, bg background) and
- Green intensity Gfg - Gbg
- and combine them in the log (base 2) ratio
- Log2( Red intensity / Green intensity)
29Gene Expression Data
- On p genes for n slides p is O(10,000), n is
O(10-100), but growing,
Slides
slide 1 slide 2 slide 3 slide 4 slide 5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene 5 in slide 4
Log2( Red intensity / Green intensity)
These values are conventionally displayed on a
red (gt0) yellow (0) green (lt0) scale.
30cDNA chip design
- Probe selection
- Non-redundant set of probes
- Includes genes of interest to project
- Corresponds to physically available clones
- Chip layout
- Grouping of probes by function
- Correspondence between wells in microtitre plates
and spots on the chip
31Glass chip manufacturing
- Choice of coupling method
- Physical (charge), non-specific chemical,
specific chemical (modified PCR primer) - Choice of printing method
- Mechanical pins flat tip, split tip, pin ring
- Piezoelectric deposition (ink-jet)
- Robot design
- Precision of movement in 3 axes
- Speed and throughput
- Number of pins, numbers of spots per pin load
32Labeling and hybridization
- Targets are normally prepared by oligo(dT) primed
cDNA synthesis - Probes should contain 3 end of mRNA
- Need CoT1 DNA as competitor
- Specific activity will limit sensitivity of assay
- Alternative protocol is to make ds cDNA
containing bacterial promoter, then cRNA - Can work with smaller amount of RNA
- Less quantitative
- Hybridization usually under coverslips
33Scanning the arrays
- Laser scanners
- Excellent spatial resolution
- Good sensitivity, but can bleach fluorochromes
- Still rather slow
- CCD scanners
- Spatial resolution can be a problem
- Sensitivity easily adjustable (exposure time)
- Faster and cheaper than lasers
- In all cases, raw data are images showing
fluorescence on surface of chip
34Microarray data on the Web
- Many groups have made their raw data available,
but in many formats - Some groups have created searchable databases
- There are several initiatives to create unified
databases - EBI ArrayExpress
- NCBI Gene Expression Omnibus
- Companies are beginning to sell microarray
expression data (e.g. Incyte)
35Bioinformatics of microarrays
- Array design choice of sequences to be used as
probes - Experimental design
- Analysis of scanned images
- Spot detection, normalization, quantitation
- Primary analysis of hybridization data
- Basic statistics, reproducibility, data
scattering, etc. - Comparison of multiple samples
- Clustering, SOMs, classification
- Sample tracking and databasing of results
36Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
37 Microarray Image Analysis
- Quantitation of fluorescence signals
38Scanner
PMT
Pinhole
Detector lens
Laser
Beam-splitter
Objective Lens
Dye
Glass Slide
39Images from scanner
- Resolution
- standard 10?m currently, max 5?m
- 100?m spot on chip 10 pixels in diameter
- Image format
- TIFF (tagged image file format) 16 bit (65536
levels of grey) - 1cm x 1cm image at 16 bit 2Mb (uncompressed)
- other formats exist e.g.. SCN (used at Stanford
University) - Separate image for each fluorescent sample
- channel 1, channel 2, etc.
40Images examples
41Practical Problems 1
- Comet Tails
- Likely caused by insufficiently rapid immersion
of the slides in the succinic anhydride blocking
solution.
42Practical Problems 2
43Practical Problems 3
- High Background
- 2 likely causes
- Insufficient blocking.
- Precipitation of the labeled probe.
- Weak Signals
44Practical Problems 4
Spot overlap Likely cause too much
rehydration during post - processing.
45Practical Problems 5
Dust
46Processing of images
- Addressing or gridding
- Assigning coordinates to each of the spots
- Segmentation
- Classification of pixels either as foreground or
as background - Intensity determination for each spot
- Foreground fluorescence intensity pairs (R, G)
- Background intensities
- Quality measures
47Addressing
- The measurement process depends on the addressing
procedure - Addressing efficiency can be enhanced by allowing
user intervention (slow!) - Most software systems now provide for both manual
and automatic gridding procedures
Registration
48Problems in automatic addressing
- Misregistration of the red and green channels
- Rotation of the array in the image
- Skew in the array
Rotation
49Segmentation
- Segmentation methods
- Fixed circle segmentation
- Adaptive circle segmentation
- Adaptive shape segmentation
- Histogram segmentation
50Information Extraction
- Spot Intensities
- mean (pixel intensities).
- median (pixel intensities).
- Background values
- Local
- Morphological opening
- Constant (global)
- None
- Quality Information
- Area
- Circularity
- Signal to Noise ratio
Take the average
51Quantification of expression
- For each spot on the slide we calculate
- Red intensity Rfg - Rbg
- fg foreground, bg background, and
- Green intensity Gfg - Gbg
- and combine them in the log (base 2) ratio
- Log2( Red intensity / Green intensity)
52Microarray Data Normalization
- Why?
- To correct for systematic differences between
samples on the same slide, or between slides,
which do not represent true biological variation
between samples. - How do we know it is necessary?
- By examining self-self hybridizations, where no
true differential expression is occurring. - We find dye biases which vary with overall spot
intensity, location on the array, plate origin,
pins, scanning parameters,. - Goals
- - Reduces systematic (not random) effects
- - Makes it possible to compare several arrays
53(No Transcript)
54- Intensity-dependent normalization
- Here, run a line through the middle of the MA
plot, shifting the M value of the pair (A,M) by
cc(A), i.e. - log2 R/G -gt log2 R/G - c (A)
- One estimate of c(A) is made using the LOWESS
function of Cleveland (1979) LOcally WEighted
Scatterplot Smoothing.
A(log Rlog G)/2 M(log R-log G)/2
55Normalization by controlsMicroarray Sample Pool
titration series
Pool the whole library
Control set to aid intensity- dependent
normalization Different concentrations in
titration series Spotted evenly spread across the
slide in each pin-group
56Differential Expression Which genes have changed?
- Goal
- Identify genes associated with covariate or
response of interest - Examples
- Qualitative covariates or factors treatment,
cell type, tumor class - Quantitative covariate dose, time
- Responses survival, cholesterol level
- Any combination of these!
57cDNA gene expression data
- Data on G genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j
(normalized) Log( Red intensity / Green intensity)
58An expression profile like this for each gene
mRNA Cy5/Cy3 r
_
5
down-regulation repression
up-regulation induction
_
1
0
time / h
Start of experiment
59Co-Regulation -- Inference of function Genes
belonging to the same pathway are often showing
the same regulatory patterns (profiles) for a
variety of biological situations (or in a time
series). Hence, as a hypothesis, genes of unknown
function showing similar regulatory behaviour as
some genes of known function may have a similar
function.
- Which genes are differentially expressed ?
- Which genes are expressed in a similar way
when comparing to expression profiles of genes
with known function? - (co-regulation)
- patterns of expression (diagnostic
Fingerprinting) - Reverse Engineering of genetic networks
Differential expression Comparing the
Transcriptomes for two different biological
samples (e.g. control, heat-shock) you are
interested in the subset of genes which are
expressed on different levels (up-/
down-regulated).
Expression-Fingerprinting Often in medical
applications it is of interest to characterize
the biological status of cells, e.g. the
severeness of tumor cells, to be able to respond
with the right therapy.
Reverse Engineering Using expression data to
infer regulatory interactions between a number of
genes responsible for a certain adaptation
process or developmental process.
60Common methods
61Common methods (II)
62Some disadvantages of common strategies
- Parametric ways are not robust enough.
- Use the actual levels of observations.
- When estimating mean and std, it may be misled by
outliers. - Nonparametric ways are not sensitive enough.
- Use the ranks of observations instead.
- There are many patterns with the same score.
63Weighted Punishment on Overlap (WEPO)
- Combine heuristics from para- and non-parametric
methods. - If a gene is differentially expressed, the
expression value of different groups should come
from quite different distributions.
64The better the genes, the less the overlap
- Score each gene via estimating the overlapped
regions of these classes. - To prevent information loss and maintain
robustness.
65Formula of Weighted Punishment
Where
66An Example
67Another Example
68Other AdvancedMicroarray data analysis
- Clustering and pattern detection
- Data mining and visualization
- Controls and normalization of results
- Statistical validation
- Linkage between gene expression data and gene
sequence/function/metabolic pathways databases - Discovery of common sequences in co-regulated
genes - Meta-studies using data from multiple experiments
69(No Transcript)
70Cluster analysis
- Used to find groups of objects when not already
known - Unsupervised learning
- Associated with each object is a set of
measurements (the feature vector) - Aim is to identify groups of similar objects on
the basis of the observed measurements
71(No Transcript)
72Clustering Gene Expression Data
- Can cluster genes (rows), e.g. to (attempt to)
identify groups of co-regulated genes - Can cluster samples (columns), e.g. to identify
tumors based on profiles - Can cluster both rows and columns at the same time
73Clustering Gene Expression Data
- Leads to readily interpretable figures
- Can be helpful for identifying patterns in time
or space - Useful (essential?) when seeking new subclasses
of samples - Can be used for exploratory purposes
74Types of Clustering
- Herarchical
- Link similar genes, build up to a tree of all
- Kmeans
- - Partition genes into a prespecified number
of groups K - Self Organizing Maps (SOM)
- Split all genes into similar sub-groups
- Finds its own groups (machine learning)
- Principle Component
- every gene is a dimension (vector), find a single
dimension that best represents the differences in
the data
75Hierarchical clustering
76 Hierarchical clustering (continued)
To transform the genesexp matrix into
genesgenes matrix, use a gene similarity
metric. (Eisen et al. 1998 PNAS 9514863-14868)
Exactly same as Pearsons correlation except the
underline
Where Gi equal the (log-transformed) primary data
for gene G in condition i. For any two genes X
and Y observed over a series of N conditions.
Goffset is set to 0, corresponding to
fluorescence ratio of 1.0
77 Hierarchical clustering (continued)
Pearsons correlation example
What if genome expression is clustered based on
negative correlation?
78 Hierarchical clustering (continued)
79Hierarchical Clustering
3 clusters?
2 clusters?
80 K-means clustering
This method differs from the hierarchical
clustering in many ways. In particular, - There
is no hierarchy, the data are partitioned. You
will be presented only with the final cluster
membership for each case. - There is no role for
the dendrogram in k-means clustering. - You must
supply the number of clusters (k) into which the
data are to be grouped.
81K-means clustering(continued)
Step 1 Transform n (genes) m (experiments)
matrix into n(genes) n(genes) distance matrix
Step 2 Cluster genes based on a k-means
clustering algorithm
82K-means clustering(continued)
To transform the nm matrix into nn matrix, use
a similarity (distance) metric.
(Tavazoie et al. Nature Genetics. 1999
Jul22(3)281-5)
Euclidean distance
Where any two genes X and Y observed over a
series of M conditions.
83K-means clustering(continued)
84K-means clustering algorithm
Step 1 Suppose distance of genes expression
patterns are positioned on a two dimensional
space based a distance matrix
Step 2 The first cluster center(red) is chosen
randomly and then subsequent centers are
by finding the data point farthest from the
centers already chosen. In this example, k3.
85K-means clustering algorithm(continued)
Step 3 Each point is assigned to the
cluster associated with the closest
representative center
Step 4 Minimizes the within-cluster sum of
squared distances from the cluster mean by
moving the centroid (star points), that is
computing a new cluster representative
86K-means clustering algorithm(continued)
Step 5 Repeat step 3 and 4 with a new
representative
Run step 3, 4 and 5 until no further changes
occur.
87K-means Clustering
The intended clusters are found.
88Web links
- Leming Shis Gene-Chips.com page very rich
source of basic information and commercial and
academic links - DNA chips for dummies animation
- A step by step description of a microarray
experiment by Jeremy Buhler - The Big Leagues Pat Brown and NHGRI microarray
projects
89Mini-Review How to make a cDNA microarray
90Glass Slide Array of bound cDNA probes 4x4
blocks 16 print-tip groups
91Microarray Experiment
92HybridizationBinding cDNA samples (targets) to
cDNA probes on slide
cover slip
Hybridise for 5-12 hours
93(No Transcript)
94Quantification of expression
- For each spot on the slide we calculate
- Red intensity Rfg - Rbg
- fg foreground, bg background, and
- Green intensity Gfg - Gbg
- and combine them in the log (base 2) ratio
- Log2( Red intensity / Green intensity)
95Some Considerations for cDNA Microarray
Experiments (I)
-
- Scientific (Aims of the experiment)
- Specific questions and priorities
- How will the experiments answer the questions
- Practical (Logistic)
- Types of mRNA samples reference, control,
treatment, mutant, etc - Source and Amount of material (tissues, cell
lines) - Number of slides available
96Some Considerations for cDNA Microarray
Experiments (II)
-
- Other Information
- Experimental process prior to hybridization
sample isolation, mRNA extraction, amplification,
labelling, - Controls planned positive, negative, ratio,
etc. - Verification method Northern, RT-PCR, in situ
hybridization, etc.
97Experimental Design
- Ensure questions of interest can be answered
accurately, under some constraints - Cost, number of slides
- Biological material, availability of mRNA
98Combining data across slides
- Data on m genes for n hybridizations
99The design issue here
- Determine which mRNAs are to be labeled with
which fluor, and which are to be hybridized
together on the same slide. - i.e, How the samples are paired onto arrays.
100Graphical Representation
- Multi-digraph
- Vertices mRNA samples
- Edges hybridization
- Direction dye assignment
Cy3 sample
Cy5 Sample
101Graphical representation of design
Cy3
Cy5
Box2. Yang and speed (2002)
102Treatments
A
B
Replicates
1
2
3
4
RNA1
RNA2
RNA3
RNA4
Dyes
R
G
R
G
R
G
R
G
Arrays
Design
103Reference Design(Ker and Churchill, 2000 )
- v varieties of interest
- Array v
104Loop Design (Ker and Churchill, 2000 )
- v varieties of interest
- Array v
105Comparing K treatments
- Common reference design
- Extensibility
- All-pairs design
- Better in precision
- Comparison within slides
106Natural design choice
C
- Case 1 Meaningful biological control (C)
- Samples Liver tissue from four mice treated by
cholesterol modifying drugs. - Question 1 Genes that respond differently
between the T and the C. - Question 2 Genes that responded similarly across
two or more treatments relative to control. - Case 2 Use of universal reference
- Samples Different tumor samples.
- Question To discover tumor subtypes.
107Extensibility
- Universal common reference for arbitrary
undetermined number of (future) experiments - Provides extensibility of the series of
experiments (within and between labs) - Linking experiments necessary if common reference
source diminished/depleted
108On Graphical Representation
- 2 mRNA samples can be compared if there is a path
- The precision depends on the number of paths
- Direct comparisons within slides more precise
than indirect ones
109Treatment vs Control
- Two samples
- e.g. KO vs. WT or mutant vs. WT
Indirect
Direct
T
Ref
T
C
C
Ref
average (log (T/C))
log (T / Ref) log (C / Ref )
?2 /2
2?2
110Common reference
A
B
C
Ref
All pairs
111(No Transcript)
112(No Transcript)
113The problem
- We suppose comparison between all pairs of
varieties are of equal interest. - For the number of arrays budgeted for an
experiment, which design should we use to gain
more precision?
114V5 , S8
2
?
1
?
?
?
5
3
?
4
V6 , S8
?
?
?
?
?
?
?
?
?
?
?
?
V7 , S9
?
?
?
?
?
?
?
From puppy (2002)
115Traditional ways
- Generate the full sets of non-isomorphic
connected designs of given the number of arrays
and samples. - Then calculate each average variance.
116Difficulties
- 11,716,571 non-isomorphic connected graphs on 10
nodes. (almost 58 hrs) - 1,006,700,565 on 11
- 164,059,830,476 on 12
- Its too time-consuming by using such strategy.
117Our strategy
- Using GA to be a smart search method, we dont
need to explore all designs but get the optimal
one. - For v 5, a 8 -gt 1.6 secs
- v 6, a 8 -gt 2.1 secs
- v 7, a 9 -gt 2.7 secs
- v 10, a 20 -gt 129.416 secs
- v 12 , a 14 -gt 136.481 secs
- v 13 , a 15 -gt 175.388 secs
1181
?
2
12
?
?
11
3
?
?
4
10
?
?
?
?
5
9
?
?
6
8
?
7
V12 , S24
119Statistical model
For any particular gene
B
A
Sample
Expression level
Intensity
G
R
Intensity ? Expression level
120(No Transcript)
121(No Transcript)
122Example Time course
T1
T2
T3
T4
t1
t2
t3
t4
t1, t2, t3, t4 true expression levels
123T1 VS. T2
124T1 VS. T2
125T3 VS. T4
126(No Transcript)