Title: Bioinformatics: Applications
1Bioinformatics Applications
- ZOO 4903
- Fall 2006, MW 1030-1145
- Sutton Hall, Room 312
- Microarrays basic data analysis methods
2Lecture overview
- What weve talked about so far
- Genes gene expression
- Microarrays measuring the entire transcriptome
- Overview
- Image processing
- Data normalization
- Statistics
3Image Processing
- 25,000 genes
- 50,000 measurements
- Chips
4Microarray Images
- Resolution
- standard 10?m (100,000 atoms wide)
- 100?m spot on chip 10 pixels in diameter
- Image format
- TIFF 16 bit (64K grey levels)
- 1cm x 1cm image at 16 bit 2Mb (uncompressed)
- Separate image for each fluorescent sample
- channel 1, channel 2
5Images are scanned separately and combined
Laser 1
Laser 2
Green channel
Red channel
Overlay images and normalize
Scan and detect with confocal laser system
Image process and analyze
6Image Processing
- Addressing or gridding
- Assigning coordinates to each of the spots
- Segmentation or spot picking
- Classifying pixels either as foreground or as
background - Intensity extraction (for each spot)
- Foreground fluorescence intensity pairs (R, G)
- Background intensities
- Quality measures
7Overview
Raw (combined) image
Gridded
Spots picked flagged
8Gridding Errors
Spotting errors
Uneven print tip hybridization
Gridding errors
9Spot Picking
- Classification of pixels as foreground or
background - Large selection of methods available, each has
strengths weaknesses
10Spot Picking
- Segmentation/spot picking methods
- Fixed circle segmentation
- Adaptive circle segmentation
- Adaptive shape segmentation
- Histogram segmentation
11Fixed Circle Segmentation
12Adaptive Circle Segmentation
- Circle diameter is estimated separately for each
spot - GenePix finds spots by detecting edges of spots
(second derivative)
13Adaptive Circle Segmentation
14Question
- Q What happens with spot finding algorithms if
the spot on the microarray is irregular (i.e.,
not a circle)
15Question
- Q What happens with spot finding algorithms if
the spot on the microarray is irregular (i.e.,
not a circle) - A Pixels are misassigned background is counted
as signal and vice versa
16Adaptive Shape Segmentation
Edge detection or Seeded Region Growing Regions
grow outwards from the seed points preferentially
according to the difference between a pixels
value and the running mean of values in an
adjoining region
17Information Extraction
- Spot Intensities
- mean (pixel intensities)
- median (pixel intensities)
- Background values
- Local Background
- Morphological opening
- Constant (global)
- Quality Information
Take the average
18Spot morphology does not affect dynamic range
- The red line indicates signal level for
non-spiked target. - Error bars represent one standard deviation for
each mean (n18) signal
19Spot Intensity
- The total amount of hybridization for a spot is
proportional to the total fluorescence at the
spot - Spot intensity pixel intensities within a spot
- Later calculations are based on ratios between
Cy5 and Cy3, so we tally in some way the
intensity of the spot - Can use ratios of medians, means or even modes
(if binned) - Non-specific hybridization subtracted (area
outside the spot)
20Mean, Median Mode
Mode
Median
Mean
21Background Intensity
- A spots measured intensity includes a
contribution of non-specific hybridization and
other chemicals on the glass - Fluorescence intensity from regions not occupied
by DNA can be different from regions occupied by
DNA
22Local Background Detection
- Focuses on small regions around spot mask
- Determine median/mean pixel values in this region
- Most common approach
- By not considering the pixels immediately
surrounding the spots, the background estimate is
less sensitive to the performance of the
segmentation procedure
23Quality problems
Irregular Spot Comet Tail Streaking
Hi Background Low Intensity OK
24Quality Measurements
- Array
- Correlation between duplicate spot intensities
- Percentage of spots with negative signals
- Distribution of actual spot signal area vs.
idealized - Inter-array consistency
- Spot
- Signal / Noise ratio
- Variation in pixel intensities within spots
25Visualizing the expression data
- A pretty picture is not enough
26Log Transformation
linear scale
log2 scale
expt A
ch2 intensity
27Choice of Base is Not Important
log10
ln
28Why Log Transform?
- Makes variation of intensities and ratios of
intensities more independent of absolute
magnitude - Evens out highly skewed distributions
- Gives more realistic sense of variation
- Approximates normal distribution
- Treats up- and down- regulated genes
symmetrically
29Log scores are symmetric
0.1 1.0
10
Linear
Same data
-1 0
1
Log10
30Log scores better visualize variation in both
directions
31A Microarray Scatter Plot
32Correlation
Comet-tailing from non- balanced channels
Cy5 (red) intensity
Cy3 (green) intensity
Linear Non-linear
33Correlation
correlation Uncorrelated -
correlation
34Correlation
High correlation
Low correlation
Perfect correlation
35Correlation Coefficient
r 0.85
r 0.4
r 1.0
36Correlation and Outliers
Experimental error or something important?
A single bad point can affect a good
correlation, and the problem with microarrays is
that we are expecting bad points
37Normal vs. Normal
Normal vs. Tumor
38(R,G) ? (M,A) Transformation
Transformed data (M,A)n1..5184 M log2(R/G)
(ratio), A log2(RG)1/2 1/2log2(RG)
(intensity) ? R(22AM)1/2, G(22A-M)1/2
39Normalization
Dealing with sources of systematic error
40Sources of Systematic Bias
- Different dye labeling efficiencies
- Scanning (laser and detector, chemistry of the
fluorescent label) - Differences in concentration of DNA on arrays
(plate effects) - Differences in total mRNA in one sample versus
another or mRNA degradation - Printing or tip problems
- Uneven hybridization
41Normalization
- Reduces systematic (multiplicative) differences
between two channels of a single hybridization or
differences between hybridizations - Several Methods
- Global mean method
- (Iterative) linear regression method
- Curvilinear methods (e.g. Lowess)
- Variance model methods
Try to get a slope 1 and a correlation of 1
42Example Where Normalization is Needed
43Example Where Normalization is Not Needed
44Normalization to a Global Mean
- Calculate mean intensity of all spots in channels
1 2 - e.g. ?ch2 25,000 ?ch2/?ch1 1.25
- ?ch1 20,000
- On average, spots in ch2 are 1.25X brighter than
spots in ch1 - To normalize, multiply spots in ch1 by 1.25
45Normalization by Iterative Linear Regression
- Fit a line (ymxb) to the data set
- set aside outliers (residuals 2 x SD)
46Background correction or not?
- No background correction necessary
47Prior to Lo(w)ess Normalization
48Global (Loess) Normalization
49A vs. M Plot
ratio log2 (Cy5 / Cy3)
0
average signal log2 (Cy3 Cy5)/2
50Loess Function
Loess function fit line
ratio log2 (Cy5 / Cy3)
0
average signal log2 (Cy3 Cy5)/2
51Data After Normalization
ratio log2 (Cy5 / Cy3)
0
average signal log2 (Cy3 Cy5)/2
52Print-tip Normalization
Print-tip layout
53Scaled Print-tip Normalization
After scaled print-tip normalization
After print-tip normalization
54Non-systematic sources of variability
Noise in the system
55Sources of variability
- Day to day variation
- Organism to organism variation
- Array to array variation
- Cross-hybridization
56The trouble with cross-hybridization
- With cross-hybridization, each probe will signal
the presence of multiple sequences other than
that it was designed for - This skews the observed data from the expected
data.
Observed expression profile vector
(cross-hybridized)
Expected expression profile vector (no
hybridization)
57Cross-hybridization
Analysis of a cross-hybridization within the
CYP450 superfamily
Xu et al. (2001) Gene
58Xhyb can be computationally predicted (and
avoided)
- Good probe design avoid spotting genes with
areas of significant sequence overlap - 20 bp of exact sequence identity
- 50 bp of over 90 sequence similarity
- BLAST can be used to check similarity, but keep
in mind that its goal is to find optimal local
alignments not a set of areas most likely to
xhyb - Stringent wash to get rid of non-specific binding
59Statistics
When does a difference make a difference?
60Statistical significance tests
- Parametric tests
- T-test (P values)
- Significance Analysis of Microarrays (modified
t-test, T values) - Mann-Whitney
- Non-parametric
- Wilcoxon Rank Sum Test
- ANOVA (2 arrays, F-value)
- Problems? n is always small
61False Discovery
- Statisticians call false positives a "type 1
error" or a "False Discovery" - False Discovery Rate (FDR) is equal to the
p-value of the t-test x the number of genes in
the array - For a p-value of 0.01 x 10,000 genes 100
false different genes - You cannot eliminate FPs, but stringent p-values
can keep them manageable (try p0.001) - The FDR must be smaller than the number of real
differences that you find - which in turn depends
on the size of the differences and variability of
the measured expression values
62Bonferroni correction
- To get a p-value of 0.05 when youre essentially
taking many many measurements, you must account
for these multiple measurements - 10,000 genes x p-value of 0.05 500
false-positives - The level for statistical significance is divided
by the number of measurements
e.g., p
63How Many Replicates?
Singletons Duplicates
3X
- Substantial error when only one array analyzed,
standard is to use 3 replicates
Lee et al. (2000) PNAS
64What Types of Replicates?
Biological replicates
Technical replicates
Biological replication is most important because
it includes all of the potential sources for error
65Final Result
Highly Expressed Reduced Expression
Trx 16.8 Enh1 13.2 Hin2 11.8 P53 8.4 Calm
7.3 Ned3 5.6 P21 5.5 Antp 5.4 Gad2
5.2 Gad3 5.1 Erp3 5.0
GPD 0.11 Shn2 0.13 Alp4 0.22 OncB 0.23 Nrd1
0.25 LamR 0.26 SetH 0.30 LinK 0.32 Mrd2
0.32 Mrd3 0.33 TshR 0.34
66Summary
- Data analysis begins with good image processing
- Sources of experimental variation lead to the
need to normalize data - One type of analysis involves clustering similar
expression profiles
67For next time