Title: HWW Gene Expression Experiments: How Why Whats the problem
1HWW Gene Expression Experiments
How?Why?Whats the problem?
2High Throughput Experiments
FunctionalGenomics
Bioinformatics
3DNA Hybridization
- The principle have two denatured DNA strands
bond together, then check double strand amount
(florescent dye, radioactive label) - Traditional Southern/Northern/Western Blot
- The great advance micro array DNA chips
automation, material eng., computer aided
(including algorithmic solutions)
4History
- cDNA microarrays have evolved from Southern
blots, with clone libraries gridded out on nylon
membrane filters being an important and still
widely used intermediate. Things took off with
the introduction of non-porous solid supports,
such as glass - these permitted miniaturization -
and fluorescence based detection. Currently,
about 20,000 cDNAs can be spotted onto a
microscope slide. The other, Affymetrix
technology can produce arrays of 100,000
oligonucleotides on a silicon chip.
5THE PROCESS
Building the Chip
PCR PURIFICATION and PREPARATION
MASSIVE PCR
PREPARING SLIDES
PRINTING
Preparing RNA
Hybing the Chip
CELL CULTURE AND HARVEST
POST PROCESSING
ARRAY HYBRIDIZATION
RNA ISOLATION
DATA ANALYSIS
PROBE LABELING
cDNA PRODUCTION
6Building the Chip
PCR PURIFICATION and PREPARATION
MASSIVE PCR
Full yeast genome 6,500 reactions
IPA precipitation EtOH washes 384-well format
PRINTING
The arrayer high precision spotting device
capable of printing 10,000 products in 14 hrs,
with a plate change every 25 mins
PREPARING SLIDES
Polylysine coating for adhering PCR products to
glass slides
POST PROCESSING
Chemically converting the positive polylysine
surface to prevent non-specific hybridization
7Preparing RNA
CELL CULTURE AND HARVEST
Designing experiments to profile
conditions/perturbations/ mutations and carefully
controlled growth conditions
RNA ISOLATION
RNA yield and purity are determined by system.
PolyA isolation is preferable but total RNA is
useable. Two RNA samples are hybridized/chip.
cDNA PRODUCTION
Single strand synthesis or amplification of RNA
can be performed. cDNA production includes
incorporation of Aminoallyl-dUTP.
8Hybing the Chip
ARRAY HYBRIDIZATION
Cy3 and Cy5 RNA samples are simultaneously
hybridized to chip. Hybs are performed for 5-12
hours and then chips are washed.
DATA ANALYSIS
Ratio measurements are determined via
quantification of 532 nm and 635 nm emission
values. Data are uploaded to the appropriate
database where statistical and other analyses can
then be performed.
PROBE LABELING
Two RNA samples are labelled with Cy3 or Cy5
monofunctional dyes via a chemical coupling to
AA-dUTP. Samples are purified using a PCR
cleanup kit.
9Printing Microarrays
- Print Head
- Plate Handling
- XYZ positioning
- Repeatability Accuracy
- Resolution
- Environmental Control
- Humidity
- Dust
- Instrument Control
- Sample Tracking Software
10Ngai Lab arrayer , UC Berkeley
11Microarray Gridder
12Printing Approaches
- Non - Contact
- Piezoelectric dispenser
- Syringe-solenoid ink-jet dispenser
- Contact (using rigid pin tools, similar to filter
array) - Tweezer
- Split pin
- Micro spotting pin
13Micro Spotting pin
14(No Transcript)
15Practical Problems
- Surface chemistry uneven surface may lead to
high background. - Dipping the pin into large volume -gt pre-printing
to drain off excess sample. - Spot variation can be due to mechanical
difference between pins. Pins could be clogged
during the printing process. - Spot size and density depends on surface and
solution properties. - Pins need good washing between samples to prevent
sample carryover.
16Post Processing Arrays
- Protocol for Post Processing Microarrays
- Hydration/Heat Fixing
- 1. Pick out about 20-30 slides to be processed.
- 2. Determine the correct orientation of slide,
and if necessary, etch label on lower left corner
of array side - 3. On back of slide, etch two lines above and
below center of array to designate array area
after processing - 4. Pour 100 ml 1X SSC into hydration tray and
warm on slide warmer at medium setting - 5. Set slide array side down and observe spots
until proper hydration is achieved. - 6. Upon reaching proper hydration, immediately
snap dry slide - 7. Place slides in rack.
17Practical Problems 1
- Comet Tails
- Likely caused by insufficiently rapid immersion
of the slides in the succinic anhydride blocking
solution.
18Practical Problems 2
19Practical Problems 3
- High Background
- 2 likely causes
- Insufficient blocking.
- Precipitation of the labeled probe.
- Weak Signals
20Practical Problems 4
Spot overlap Likely cause too much
rehydration during post - processing.
21Practical Problems 5
Dust
22Steps in Images Processing
1. Addressing locate centers
2. Segmentation classification of pixels either
as signal or background. using seeded region
growing).
3. Information extraction for each spot of the
array, calculates signal intensity pairs,
background and quality measures.
23Steps in Image Processing
3. Information Extraction
- Spot Intensities
- mean (pixel intensities).
- median (pixel intensities).
- Pixel variation (IQR of log (pixel intensities).
- Background values
- Local
- Morphological opening
- Constant (global)
- None
- Quality Information
Signal
Background
24Addressing
- This is the process of assigning coordinates
to each of the spots. -
- Automating this part of the procedure permits
high throughput analysis. -
4 by 4 grids 19 by 21 spots per grid
25Addressing
Registration
26Problems in automatic addressing
- Misregistration of the red and green channels
- Rotation of the array in the image
- Skew in the array
Rotation
27Segmentation methods
- Fixed circles
- Adaptive Circle
- Adaptive Shape
- Edge detection.
- Seeded Region Growing. (R. Adams and L. Bishof
(1994) Regions grow outwards from the seed
points preferentially according to the difference
between a pixels value and the running mean of
values in an adjoining region. - Histogram Methods
- Adaptive threshold.
28Examples of algorithms and software implementation
29Limitation of fixed circle method
SRG
Fixed Circle
30Limitation of circular segmentation
Results from SRG
31Information Extraction
- Spot Intensities
- mean (pixel intensities).
- median (pixel intensities).
- Background values
- Local
- Morphological opening
- Constant (global)
- None
- Quality Information
Take the average
32Local Backgrounds
33Summary of analysis possibilities
- Determine genes which are differentially
expressed (this task can take many forms
depending on replication, etc) - Connect differentially expressed genes to
sequence databases and perhaps carry out further
analyses, e.g. searching for common upstream
motifs - Overlay differentially expressed genes on pathway
diagrams - Relate expression levels to other information on
cells, e.g. known tumour types - Define subclasses (clusters) in sets of samples
(e.g. tumours) - Identify temporal or spatial trends in gene
expression - Seek roles for genes on the basis of patterns of
co-expression - ..much more
- Many challenges transcriptional regulation
involves redundancy, feedback, amplification, ..
non-linearity
34Biological Question
Data Analysis Modeling
Sample preparation
Microarray Life Cycle
MicroarrayDetection
Microarray Reaction
Taken from Schena Davis
35Oligonucleotide Arrays
36Schadt et al., Journal of Cellular Biochemistry,
2000
37Oligonucleotide Arrays Tech.
- 20 probes per gene, 25bases each
- Probe size 24x24 micron (contain 106 copies of
the probe) - Probe is either a Perfect Match (PP) or a Miss
Match (MM) - MM
- usually at the center of the probe
- Aim to give estimate on the random hybrd.
38Motivation
- Data is noisy, missing values.
- Each array is scanned separately, in different
settings - ? To extract biological meaningful results we
need
- Good expression estimations
- Scale/Normalize across arrays
39What we need
- Image segmentation
- Background/Gradient correction
- Artifact detection
- Allow array to array comparison (scale/normalize)
- Assess gene presence (quantitative Measure)
- Find differentially expressed genes
40Why isnt Normalization Easy?
- No ability to read mRNA level directly
- Various noise factors ? hard to model exactly.
- Variable biological settings, experiment
dependent.
- Need to differentiate between changes caused by
biological signal from noise artifacts.
41Variability Sources
- Real Biology
- Biological noise
- Biological Signal
- Sample preparation related
- Technical dependent
42dChip MBEI
- Based on several papers by Li Wong (PNAS, 2001
vol 98 no.1 and others) - Implemented on their freely available dChip
software - Model based The estimation is based on a model
of how the probe intensity values respond to
changes of the expression levels of the gene
43dChip Model
i is the array indexj is the probe index
is the baseline response of the probe due to non
specific hybridization
is the rate of increase of the MM response
is the additional rate of increase of the PM
response
44dChip Reduced Model
Basic idea Least square parameter estimation,
iteratively fitting and
45dChip Reduced Model
For one array, assume that the set has
been learned from a large number of arrays, and
therefore known and fixed Given this set, the
linear least square estimate for theta is
An approx. Std. can be computed for this
estimator
46dChip Reduced Model
- Similarly, we regard the set as known, and
compute std. for each phi - We use these estimated Std. to find outlier and
exclude them from the computation
47Dchip Array outliers detection
48Dchip Probe outliers detection
49Normalization/Scaling
- We saw how to get MBEI from dchip, i.e measure
quantitation - We still need to scale the different arrays
- Arrays usually differ in overall image brightness
(differ in time, place, exper. Cond.) - This is usually done PRIOR to the measure
quantitation manipulations (as dChips MBEI we
just described).
50Global Normalization/Scaling
- Suppose we have two arrays X,Y with values x1xM
and y1 .. yM - Global normalization (MAS 5) find the constant
a such that - Which means
- When we have multiple arrays then we choose Y to
be the avg. of all arrays or compute a such that
sum_i (x_i) constant
Better way a(x) i.e adopt the fit parameter as a
function of expression level ( as by dChip)
51dChip Normalization/Scaling
- Big question Which gene to use for this
scaling?? - There are various ways to choose the set
- House keeping genes (Affy. chips)
- Spiked controls added in various stages of the
experiment, in a range of concentrations - Both of the above are very good in theory but
(still) not in practice (esp. in Affy chips) - The result several approaches suggested on how
to use the set of genes tested in the experiments - Well review dChips solution The Invariant set
52dChip Invariant Set
- Main idea
- Initialize set of probes P all probes
- Order the probes in both arrays by their
expression values - Give each probe in each array an index according
to its relative expression order - Find a set of probes P whos relative order is
similar in both arrays - Set P P and iterate from stage (2) until
convergence - Use the resulting P to compute a piecewise linear
running median line as the normalization curve
53(No Transcript)
54(No Transcript)
55Normalization Tools Current State
- Commonly Used
- RMA by Speed Lab
- dChip by Li Wong
- GeneChip MAS5 (Affy. built in tool)
- The Future
- New Chip design (both Affy. And cDNA) with better
probes, better built in controls etc. - New algorithms facilitating probes GC content
(gcRMA), location etc. - New MAS tool (this year ?) is also supposed to
incorporate RMA,dChip etc.
56How to Measure Performance?
- Theoretical Validation use some theoretical
assumptions and evaluate Statistical
characteristics of the method at hand. - Experimental Validation
- Use public data sets to measure different aspects
of performance - Evaluate relevant characteristics on your data
set. Design your data set accordingly (if
possible)
57A Benchmark for Affy. Expression Measures
- Main Idea Define a universal test set test
statistics - Based on 3 publicly available spike in data sets
- Tests for
- Variability across replicate arrays
- Response of GE measures to change in abundance of
RNA - Sensitivity of fold change measures to amount of
actual RNA sample - Accuracy of fold change as a measure of relative
expression - Usefulness of raw fold change score to detect
differential expressed genes
Cope et al. Bioinformatics, 03 (Speeds Lab)
58MA Plot
M1 X1 X2A (X1 X2)/ 2 Where Xi is the
log2 of expression measure
59Variance across replicates plot
Test Statistics 1. Median std. 2. Avg. R2
(squared corr. coef.) between two replicates
60Observed Expression vs. Nominal Expression Plots
Test Statistics Fit a linear curve and
compute1. linear fit slope (should be 1) 2. R2
to the linear fit
61ROC Curves
- One of the chief uses of GE arrays is to identify
differentially expressed genes - ROC ( Receiver Operator Characteristic)A
graphical representation of both Sens. and Spec.
as a function of threshold value - X axis TPR (Sens.)
- Y axis FPR (1-Spec.)
- In this case Use fold change as the score,
knowing which probes are spiked or not..
62FC ROC Plots
Here actual TP, FP numbers are used for the
axes Test Statistic AUC (area under the graph)
63FC ROC Plots
Same as before, but only for FC 2 cases (harder)
64The Benchmark Bottom Line
- 15 parameters used to test performace
- 3 synthetic spike in data sets
- Automatic submission and evaluation tool
comparative results atwww.biostat.jhsph.edu
65Other Tests
- Evaluate separately normalization and expression
measures techniques ( as by Huffman et al.,
Genome Biology, Vol. 3, 2002) - How do we evaluate performance on our own, very
specific, data??? ( hint see next class..)