Introduction to Microarray and Data Analysis - PowerPoint PPT Presentation

1 / 126
About This Presentation
Title:

Introduction to Microarray and Data Analysis

Description:

Idea: measure the amount of mRNA to see which genes are being expressed in (used ... rapid immersion of the s in the succinic anhydride blocking solution. ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 127
Provided by: aide4
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Microarray and Data Analysis


1
Introduction to Microarray andData Analysis
  • By
  • Han-Yu Chuang
  • 03/11/04

2
Biological background Molecular Biology
3
The Central Dogmaof Molecular Biology
4
Basic principles in physics, chemistry and
biology
Principles Known?
Physics Matter
Chemistry Compound
Biology Organism
Elementary Particles Yes
Genes No
Elements Yes
Every biological rule has exceptions!
5
Measuring Gene Expression
Idea measure the amount of mRNA to see which
genes are being expressed in (used by) the cell.
Measuring protein would be more direct, but is
currently harder.
6
Microarrays provide a means to measure gene
expression
7
How to measure gene expression?
8
A simple idea Northern Blot
9
Technology Advanced
10
What is Microarray?
  • Put a large number (100K) of cDNA sequences or
    synthetic DNA oligomers onto a glass slide (or
    other substrate) in known locations on a grid.
  • Label an RNA sample and hybridize
  • Measure amounts of RNA bound to each square in
    the grid

11
Imagination on Microarray
12
(No Transcript)
13
Basic principles
  • Main novelty is one of scale
  • hundreds or thousands of probes rather than tens
  • Probes are attached to solid supports
  • Robotics are used extensively
  • Informatics is a central component at all stages

14
Major technologies
  • cDNA probes (gt 200 nt), usually produced by PCR,
    attached to either nylon or glass supports
  • Oligonucleotides (25-80 nt) attached to glass
    support
  • Oligonucleotides (25-30 nt) synthesized in situ
    on silica wafers (Affymetrix)
  • Probes attached to tagged beads

15
Areas Being Studied with Microarrays
  • Differential gene expression between two (or
    more) sample types
  • Similar gene expression across treatments
  • Tumor sub-class identification using gene
    expression profiles
  • Classification of malignancies into known classes
  • Identification of marker genes that
    characterize different tumor classes
  • Identification of genes associated with clinical
    outcomes (e.g. survival)

16
Applications
  • Pathway Inference
  • (Gene regulatory network prediction)
  • Disease detection
  • (probe detection classification)

17
Principal uses of chips
  • Genome-scale gene expression analysis
  • Differentiation
  • Responses to environmental factors
  • Disease processes
  • Effects of drugs
  • Detection of sequence variation
  • Genetic typing
  • Detection of somatic mutations (e.g. in
    oncogenes)
  • Direct sequencing

18
cDNA chips
  • Probes are cDNA fragments, usually amplified by
    PCR
  • Probes are deposited on a solid support, either
    positively charged nylon or glass slide
  • Samples (normally poly(A) RNA) are labelled
    using fluorescent dyes
  • At least two samples are hybridized to chip
  • Fluorescence at different wavelengths measured by
    a scanner

19
Standard protocol for comparative hybridization
20
cDNA microarray experiments
  • mRNA levels compared in many different contexts
  • Different tissues, same organism (brain v.
    liver)
  • Same tissue, same organism (ttt v. ctl, tumor v.
    non-tumor)
  • Same tissue, different organisms (wt v. ko, tg,
    or mutant)
  • Time course experiments (effect of ttt,
    development)
  • Other special designs (e.g. to detect spatial
    patterns).

21
Web animation of a cDNA microarray experiment
http//www.bio.davidson.edu/courses/genomics/chip/
chip.html DNA Microarray Technique
22
Yeast genome on a chip
23
Brief outline of steps for producing a microarray
  • cDNA probes attached or synthesized to solid
    support
  • Hybridize targets
  • Scan array

24
  • Using Microarray with cDNA or oligonucleotide

Building the Chip
MASSIVE PCR
PCR PURIFICATION and PREPARATION
PREPARING SLIDES
PRINTING
Preparing RNA
Hybing the Chip
POST PROCESSING
CELL CULTURE AND HARVEST
ARRAY HYBRIDIZATION
RNA ISOLATION
DATA ANALYSIS
PROBE LABELING
cDNA PRODUCTION
25
cDNA microarrays
cDNA clones
26
cDNA microarrays
  • Compare the genetic expression in two samples of
    cells

PRINT cDNA from one gene on each spot
SAMPLES cDNA labelled red/green
e.g. treatment / control normal / tumor
tissue
27
HYBRIDIZE Add equal amounts of labelled cDNA
samples to microarray.
SCAN
Laser
Detector
28
Quantification of expression
  • For each spot on the slide we calculate
  • Red intensity Rfg - Rbg
  • (fg foreground, bg background) and
  • Green intensity Gfg - Gbg
  • and combine them in the log (base 2) ratio
  • Log2( Red intensity / Green intensity)

29
Gene Expression Data
  • On p genes for n slides p is O(10,000), n is
    O(10-100), but growing,

Slides
slide 1 slide 2 slide 3 slide 4 slide 5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene 5 in slide 4

Log2( Red intensity / Green intensity)
These values are conventionally displayed on a
red (gt0) yellow (0) green (lt0) scale.
30
cDNA chip design
  • Probe selection
  • Non-redundant set of probes
  • Includes genes of interest to project
  • Corresponds to physically available clones
  • Chip layout
  • Grouping of probes by function
  • Correspondence between wells in microtitre plates
    and spots on the chip

31
Glass chip manufacturing
  • Choice of coupling method
  • Physical (charge), non-specific chemical,
    specific chemical (modified PCR primer)
  • Choice of printing method
  • Mechanical pins flat tip, split tip, pin ring
  • Piezoelectric deposition (ink-jet)
  • Robot design
  • Precision of movement in 3 axes
  • Speed and throughput
  • Number of pins, numbers of spots per pin load

32
Labeling and hybridization
  • Targets are normally prepared by oligo(dT) primed
    cDNA synthesis
  • Probes should contain 3 end of mRNA
  • Need CoT1 DNA as competitor
  • Specific activity will limit sensitivity of assay
  • Alternative protocol is to make ds cDNA
    containing bacterial promoter, then cRNA
  • Can work with smaller amount of RNA
  • Less quantitative
  • Hybridization usually under coverslips

33
Scanning the arrays
  • Laser scanners
  • Excellent spatial resolution
  • Good sensitivity, but can bleach fluorochromes
  • Still rather slow
  • CCD scanners
  • Spatial resolution can be a problem
  • Sensitivity easily adjustable (exposure time)
  • Faster and cheaper than lasers
  • In all cases, raw data are images showing
    fluorescence on surface of chip

34
Microarray data on the Web
  • Many groups have made their raw data available,
    but in many formats
  • Some groups have created searchable databases
  • There are several initiatives to create unified
    databases
  • EBI ArrayExpress
  • NCBI Gene Expression Omnibus
  • Companies are beginning to sell microarray
    expression data (e.g. Incyte)

35
Bioinformatics of microarrays
  • Array design choice of sequences to be used as
    probes
  • Experimental design
  • Analysis of scanned images
  • Spot detection, normalization, quantitation
  • Primary analysis of hybridization data
  • Basic statistics, reproducibility, data
    scattering, etc.
  • Comparison of multiple samples
  • Clustering, SOMs, classification
  • Sample tracking and databasing of results

36
Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
37
Microarray Image Analysis
  • Quantitation of fluorescence signals

38
Scanner
PMT
Pinhole
Detector lens
Laser
Beam-splitter
Objective Lens
Dye
Glass Slide
39
Images from scanner
  • Resolution
  • standard 10?m currently, max 5?m
  • 100?m spot on chip 10 pixels in diameter
  • Image format
  • TIFF (tagged image file format) 16 bit (65536
    levels of grey)
  • 1cm x 1cm image at 16 bit 2Mb (uncompressed)
  • other formats exist e.g.. SCN (used at Stanford
    University)
  • Separate image for each fluorescent sample
  • channel 1, channel 2, etc.

40
Images examples
41
Practical Problems 1
  • Comet Tails
  • Likely caused by insufficiently rapid immersion
    of the slides in the succinic anhydride blocking
    solution.

42
Practical Problems 2
43
Practical Problems 3
  • High Background
  • 2 likely causes
  • Insufficient blocking.
  • Precipitation of the labeled probe.
  • Weak Signals

44
Practical Problems 4
Spot overlap Likely cause too much
rehydration during post - processing.
45
Practical Problems 5
Dust
46
Processing of images
  • Addressing or gridding
  • Assigning coordinates to each of the spots
  • Segmentation
  • Classification of pixels either as foreground or
    as background
  • Intensity determination for each spot
  • Foreground fluorescence intensity pairs (R, G)
  • Background intensities
  • Quality measures

47
Addressing
  • The measurement process depends on the addressing
    procedure
  • Addressing efficiency can be enhanced by allowing
    user intervention (slow!)
  • Most software systems now provide for both manual
    and automatic gridding procedures

Registration
48
Problems in automatic addressing
  • Misregistration of the red and green channels
  • Rotation of the array in the image
  • Skew in the array

Rotation
49
Segmentation
  • Segmentation methods
  • Fixed circle segmentation
  • Adaptive circle segmentation
  • Adaptive shape segmentation
  • Histogram segmentation

50
Information Extraction
  • Spot Intensities
  • mean (pixel intensities).
  • median (pixel intensities).
  • Background values
  • Local
  • Morphological opening
  • Constant (global)
  • None
  • Quality Information
  • Area
  • Circularity
  • Signal to Noise ratio

Take the average
51
Quantification of expression
  • For each spot on the slide we calculate
  • Red intensity Rfg - Rbg
  • fg foreground, bg background, and
  • Green intensity Gfg - Gbg
  • and combine them in the log (base 2) ratio
  • Log2( Red intensity / Green intensity)

52
Microarray Data Normalization
  • Why?
  • To correct for systematic differences between
    samples on the same slide, or between slides,
    which do not represent true biological variation
    between samples.
  • How do we know it is necessary?
  • By examining self-self hybridizations, where no
    true differential expression is occurring.
  • We find dye biases which vary with overall spot
    intensity, location on the array, plate origin,
    pins, scanning parameters,.
  • Goals
  • - Reduces systematic (not random) effects
  • - Makes it possible to compare several arrays

53
(No Transcript)
54
  • Intensity-dependent normalization
  • Here, run a line through the middle of the MA
    plot, shifting the M value of the pair (A,M) by
    cc(A), i.e.
  • log2 R/G -gt log2 R/G - c (A)
  • One estimate of c(A) is made using the LOWESS
    function of Cleveland (1979) LOcally WEighted
    Scatterplot Smoothing.

A(log Rlog G)/2 M(log R-log G)/2
55
Normalization by controlsMicroarray Sample Pool
titration series
Pool the whole library
Control set to aid intensity- dependent
normalization Different concentrations in
titration series Spotted evenly spread across the
slide in each pin-group
56
Differential Expression Which genes have changed?
  • Goal
  • Identify genes associated with covariate or
    response of interest
  • Examples
  • Qualitative covariates or factors treatment,
    cell type, tumor class
  • Quantitative covariate dose, time
  • Responses survival, cholesterol level
  • Any combination of these!

57
cDNA gene expression data
  • Data on G genes for n samples

mRNA samples
sample1 sample2 sample3 sample4 sample5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene i in mRNA sample j

(normalized) Log( Red intensity / Green intensity)
58
An expression profile like this for each gene
mRNA Cy5/Cy3 r
_
5
down-regulation repression
up-regulation induction
_
1
0
time / h
Start of experiment
59
Co-Regulation -- Inference of function Genes
belonging to the same pathway are often showing
the same regulatory patterns (profiles) for a
variety of biological situations (or in a time
series). Hence, as a hypothesis, genes of unknown
function showing similar regulatory behaviour as
some genes of known function may have a similar
function.
  • Which genes are differentially expressed ?
  • Which genes are expressed in a similar way
    when comparing to expression profiles of genes
    with known function?
  • (co-regulation)
  • patterns of expression (diagnostic
    Fingerprinting)
  • Reverse Engineering of genetic networks

Differential expression Comparing the
Transcriptomes for two different biological
samples (e.g. control, heat-shock) you are
interested in the subset of genes which are
expressed on different levels (up-/
down-regulated).
Expression-Fingerprinting Often in medical
applications it is of interest to characterize
the biological status of cells, e.g. the
severeness of tumor cells, to be able to respond
with the right therapy.
Reverse Engineering Using expression data to
infer regulatory interactions between a number of
genes responsible for a certain adaptation
process or developmental process.
60
Common methods
  • t-Test
  • Fisher
  • Golub

61
Common methods (II)
  • TNOM
  • Wilcoxon
  • WEPO

62
Some disadvantages of common strategies
  • Parametric ways are not robust enough.
  • Use the actual levels of observations.
  • When estimating mean and std, it may be misled by
    outliers.
  • Nonparametric ways are not sensitive enough.
  • Use the ranks of observations instead.
  • There are many patterns with the same score.

63
Weighted Punishment on Overlap (WEPO)
  • Combine heuristics from para- and non-parametric
    methods.
  • If a gene is differentially expressed, the
    expression value of different groups should come
    from quite different distributions.

64
The better the genes, the less the overlap
  • Score each gene via estimating the overlapped
    regions of these classes.
  • To prevent information loss and maintain
    robustness.

65
Formula of Weighted Punishment
Where
66
An Example
67
Another Example
68
Other AdvancedMicroarray data analysis
  • Clustering and pattern detection
  • Data mining and visualization
  • Controls and normalization of results
  • Statistical validation
  • Linkage between gene expression data and gene
    sequence/function/metabolic pathways databases
  • Discovery of common sequences in co-regulated
    genes
  • Meta-studies using data from multiple experiments

69
(No Transcript)
70
Cluster analysis
  • Used to find groups of objects when not already
    known
  • Unsupervised learning
  • Associated with each object is a set of
    measurements (the feature vector)
  • Aim is to identify groups of similar objects on
    the basis of the observed measurements

71
(No Transcript)
72
Clustering Gene Expression Data
  • Can cluster genes (rows), e.g. to (attempt to)
    identify groups of co-regulated genes
  • Can cluster samples (columns), e.g. to identify
    tumors based on profiles
  • Can cluster both rows and columns at the same time

73
Clustering Gene Expression Data
  • Leads to readily interpretable figures
  • Can be helpful for identifying patterns in time
    or space
  • Useful (essential?) when seeking new subclasses
    of samples
  • Can be used for exploratory purposes

74
Types of Clustering
  • Herarchical
  • Link similar genes, build up to a tree of all
  • Kmeans
  • - Partition genes into a prespecified number
    of groups K
  • Self Organizing Maps (SOM)
  • Split all genes into similar sub-groups
  • Finds its own groups (machine learning)
  • Principle Component
  • every gene is a dimension (vector), find a single
    dimension that best represents the differences in
    the data

75
Hierarchical clustering
76
Hierarchical clustering (continued)
To transform the genesexp matrix into
genesgenes matrix, use a gene similarity
metric. (Eisen et al. 1998 PNAS 9514863-14868)
Exactly same as Pearsons correlation except the
underline
Where Gi equal the (log-transformed) primary data
for gene G in condition i. For any two genes X
and Y observed over a series of N conditions.
Goffset is set to 0, corresponding to
fluorescence ratio of 1.0
77
Hierarchical clustering (continued)
Pearsons correlation example
What if genome expression is clustered based on
negative correlation?
78
Hierarchical clustering (continued)
79
Hierarchical Clustering
3 clusters?
2 clusters?
80
K-means clustering
This method differs from the hierarchical
clustering in many ways. In particular, - There
is no hierarchy, the data are partitioned. You
will be presented only with the final cluster
membership for each case. - There is no role for
the dendrogram in k-means clustering. - You must
supply the number of clusters (k) into which the
data are to be grouped.
81
K-means clustering(continued)
Step 1 Transform n (genes) m (experiments)
matrix into n(genes) n(genes) distance matrix
Step 2 Cluster genes based on a k-means
clustering algorithm
82
K-means clustering(continued)
To transform the nm matrix into nn matrix, use
a similarity (distance) metric.
(Tavazoie et al. Nature Genetics. 1999
Jul22(3)281-5)
Euclidean distance
Where any two genes X and Y observed over a
series of M conditions.
83
K-means clustering(continued)
84
K-means clustering algorithm
Step 1 Suppose distance of genes expression
patterns are positioned on a two dimensional
space based a distance matrix
Step 2 The first cluster center(red) is chosen
randomly and then subsequent centers are
by finding the data point farthest from the
centers already chosen. In this example, k3.
85
K-means clustering algorithm(continued)
Step 3 Each point is assigned to the
cluster associated with the closest
representative center
Step 4 Minimizes the within-cluster sum of
squared distances from the cluster mean by
moving the centroid (star points), that is
computing a new cluster representative
86
K-means clustering algorithm(continued)
Step 5 Repeat step 3 and 4 with a new
representative
Run step 3, 4 and 5 until no further changes
occur.
87
K-means Clustering
The intended clusters are found.
88
Web links
  • Leming Shis Gene-Chips.com page very rich
    source of basic information and commercial and
    academic links
  • DNA chips for dummies animation
  • A step by step description of a microarray
    experiment by Jeremy Buhler
  • The Big Leagues Pat Brown and NHGRI microarray
    projects

89
Mini-Review How to make a cDNA microarray
90
Glass Slide Array of bound cDNA probes 4x4
blocks 16 print-tip groups
91
Microarray Experiment
92
HybridizationBinding cDNA samples (targets) to
cDNA probes on slide
cover slip
Hybridise for 5-12 hours
93
(No Transcript)
94
Quantification of expression
  • For each spot on the slide we calculate
  • Red intensity Rfg - Rbg
  • fg foreground, bg background, and
  • Green intensity Gfg - Gbg
  • and combine them in the log (base 2) ratio
  • Log2( Red intensity / Green intensity)

95
Some Considerations for cDNA Microarray
Experiments (I)
  • Scientific (Aims of the experiment)
  • Specific questions and priorities
  • How will the experiments answer the questions
  • Practical (Logistic)
  • Types of mRNA samples reference, control,
    treatment, mutant, etc
  • Source and Amount of material (tissues, cell
    lines)
  • Number of slides available

96
Some Considerations for cDNA Microarray
Experiments (II)
  • Other Information
  • Experimental process prior to hybridization
    sample isolation, mRNA extraction, amplification,
    labelling,
  • Controls planned positive, negative, ratio,
    etc.
  • Verification method Northern, RT-PCR, in situ
    hybridization, etc.

97
Experimental Design
  • Ensure questions of interest can be answered
    accurately, under some constraints
  • Cost, number of slides
  • Biological material, availability of mRNA

98
Combining data across slides
  • Data on m genes for n hybridizations

99
The design issue here
  • Determine which mRNAs are to be labeled with
    which fluor, and which are to be hybridized
    together on the same slide.
  • i.e, How the samples are paired onto arrays.

100
Graphical Representation
  • Multi-digraph
  • Vertices mRNA samples
  • Edges hybridization
  • Direction dye assignment

Cy3 sample
Cy5 Sample
101
Graphical representation of design
Cy3
Cy5
Box2. Yang and speed (2002)
102
Treatments
A
B
Replicates
1
2
3
4
RNA1
RNA2
RNA3
RNA4
Dyes
R
G
R
G
R
G
R
G
Arrays
Design
103
Reference Design(Ker and Churchill, 2000 )
  • v varieties of interest
  • Array v

104
Loop Design (Ker and Churchill, 2000 )
  • v varieties of interest
  • Array v

105
Comparing K treatments
  • Common reference design
  • Extensibility
  • All-pairs design
  • Better in precision
  • Comparison within slides

106
Natural design choice
C
  • Case 1 Meaningful biological control (C)
  • Samples Liver tissue from four mice treated by
    cholesterol modifying drugs.
  • Question 1 Genes that respond differently
    between the T and the C.
  • Question 2 Genes that responded similarly across
    two or more treatments relative to control.
  • Case 2 Use of universal reference
  • Samples Different tumor samples.
  • Question To discover tumor subtypes.

107
Extensibility
  • Universal common reference for arbitrary
    undetermined number of (future) experiments
  • Provides extensibility of the series of
    experiments (within and between labs)
  • Linking experiments necessary if common reference
    source diminished/depleted

108
On Graphical Representation
  • 2 mRNA samples can be compared if there is a path
  • The precision depends on the number of paths
  • Direct comparisons within slides more precise
    than indirect ones

109
Treatment vs Control
  • Two samples
  • e.g. KO vs. WT or mutant vs. WT

Indirect
Direct
T
Ref
T
C
C
Ref
average (log (T/C))
log (T / Ref) log (C / Ref )
?2 /2
2?2
110
Common reference

A
B
C
Ref
All pairs
111
(No Transcript)
112
(No Transcript)
113
The problem
  • We suppose comparison between all pairs of
    varieties are of equal interest.
  • For the number of arrays budgeted for an
    experiment, which design should we use to gain
    more precision?

114
V5 , S8
2
?
1
?
?
?
5
3
?
4
V6 , S8
?
?
?
?
?
?
?
?
?
?
?
?
V7 , S9
?
?
?
?
?
?
?
From puppy (2002)
115
Traditional ways
  • Generate the full sets of non-isomorphic
    connected designs of given the number of arrays
    and samples.
  • Then calculate each average variance.

116
Difficulties
  • 11,716,571 non-isomorphic connected graphs on 10
    nodes. (almost 58 hrs)
  • 1,006,700,565 on 11
  • 164,059,830,476 on 12
  • Its too time-consuming by using such strategy.

117
Our strategy
  • Using GA to be a smart search method, we dont
    need to explore all designs but get the optimal
    one.
  • For v 5, a 8 -gt 1.6 secs
  • v 6, a 8 -gt 2.1 secs
  • v 7, a 9 -gt 2.7 secs
  • v 10, a 20 -gt 129.416 secs
  • v 12 , a 14 -gt 136.481 secs
  • v 13 , a 15 -gt 175.388 secs

118
1
?
2
12
?
?
11
3
?
?
4
10
?
?
?
?
5
9
?
?
6
8
?
7
V12 , S24
119
Statistical model
For any particular gene
B
A
Sample
Expression level
Intensity
G
R
Intensity ? Expression level
120
(No Transcript)
121
(No Transcript)
122
Example Time course
T1
T2
T3
T4
t1
t2
t3
t4
t1, t2, t3, t4 true expression levels
123
T1 VS. T2
124
T1 VS. T2
125
T3 VS. T4
126
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com