Bioinformatics: Applications - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

Bioinformatics: Applications

Description:

Intensity extraction (for each spot) Foreground fluorescence intensity pairs (R, G) ... Spot Picking. Classification of pixels as foreground or background ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 68

Provided by: jonath76

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics: Applications

1
Bioinformatics Applications

ZOO 4903
Fall 2006, MW 1030-1145
Sutton Hall, Room 312
Microarrays basic data analysis methods

2
Lecture overview

What weve talked about so far
Genes gene expression
Microarrays measuring the entire transcriptome
Overview
Image processing
Data normalization
Statistics

3
Image Processing

25,000 genes
50,000 measurements
Chips

4
Microarray Images

Resolution
standard 10?m (100,000 atoms wide)
100?m spot on chip 10 pixels in diameter
Image format
TIFF 16 bit (64K grey levels)
1cm x 1cm image at 16 bit 2Mb (uncompressed)
Separate image for each fluorescent sample
channel 1, channel 2

5
Images are scanned separately and combined
Laser 1
Laser 2
Green channel
Red channel
Overlay images and normalize
Scan and detect with confocal laser system
Image process and analyze
6
Image Processing

Addressing or gridding
Assigning coordinates to each of the spots
Segmentation or spot picking
Classifying pixels either as foreground or as
background
Intensity extraction (for each spot)
Foreground fluorescence intensity pairs (R, G)
Background intensities
Quality measures

7
Overview
Raw (combined) image
Gridded
Spots picked flagged
8
Gridding Errors
Spotting errors
Uneven print tip hybridization
Gridding errors
9
Spot Picking

Classification of pixels as foreground or
background
Large selection of methods available, each has
strengths weaknesses

10
Spot Picking

Segmentation/spot picking methods
Fixed circle segmentation
Adaptive circle segmentation
Adaptive shape segmentation
Histogram segmentation

11
Fixed Circle Segmentation
12
Adaptive Circle Segmentation

Circle diameter is estimated separately for each
spot
GenePix finds spots by detecting edges of spots
(second derivative)

13
Adaptive Circle Segmentation
14
Question

Q What happens with spot finding algorithms if
the spot on the microarray is irregular (i.e.,
not a circle)

15
Question

Q What happens with spot finding algorithms if
the spot on the microarray is irregular (i.e.,
not a circle)
A Pixels are misassigned background is counted
as signal and vice versa

16
Adaptive Shape Segmentation
Edge detection or Seeded Region Growing Regions
grow outwards from the seed points preferentially
according to the difference between a pixels
value and the running mean of values in an
adjoining region
17
Information Extraction

Spot Intensities
mean (pixel intensities)
median (pixel intensities)
Background values
Local Background
Morphological opening
Constant (global)
Quality Information

Take the average
18
Spot morphology does not affect dynamic range

The red line indicates signal level for
non-spiked target.
Error bars represent one standard deviation for
each mean (n18) signal

19
Spot Intensity

The total amount of hybridization for a spot is
proportional to the total fluorescence at the
spot
Spot intensity pixel intensities within a spot
Later calculations are based on ratios between
Cy5 and Cy3, so we tally in some way the
intensity of the spot
Can use ratios of medians, means or even modes
(if binned)
Non-specific hybridization subtracted (area
outside the spot)

20
Mean, Median Mode
Mode
Median
Mean
21
Background Intensity

A spots measured intensity includes a
contribution of non-specific hybridization and
other chemicals on the glass
Fluorescence intensity from regions not occupied
by DNA can be different from regions occupied by
DNA

22
Local Background Detection

Focuses on small regions around spot mask
Determine median/mean pixel values in this region
Most common approach

By not considering the pixels immediately
surrounding the spots, the background estimate is
less sensitive to the performance of the
segmentation procedure

23
Quality problems
Irregular Spot Comet Tail Streaking
Hi Background Low Intensity OK
24
Quality Measurements

Array
Correlation between duplicate spot intensities
Percentage of spots with negative signals
Distribution of actual spot signal area vs.
idealized
Inter-array consistency
Spot
Signal / Noise ratio
Variation in pixel intensities within spots

25
Visualizing the expression data

A pretty picture is not enough

26
Log Transformation
linear scale
log2 scale
expt A
ch2 intensity
27
Choice of Base is Not Important
log10
ln
28
Why Log Transform?

Makes variation of intensities and ratios of
intensities more independent of absolute
magnitude
Evens out highly skewed distributions
Gives more realistic sense of variation
Approximates normal distribution
Treats up- and down- regulated genes
symmetrically

29
Log scores are symmetric
0.1 1.0
10
Linear
Same data
-1 0
1
Log10
30
Log scores better visualize variation in both
directions
31
A Microarray Scatter Plot
32
Correlation
Comet-tailing from non- balanced channels
Cy5 (red) intensity
Cy3 (green) intensity
Linear Non-linear
33
Correlation
correlation Uncorrelated -
correlation
34
Correlation
High correlation
Low correlation
Perfect correlation
35
Correlation Coefficient
r 0.85
r 0.4
r 1.0
36
Correlation and Outliers
Experimental error or something important?
A single bad point can affect a good
correlation, and the problem with microarrays is
that we are expecting bad points
37
Normal vs. Normal
Normal vs. Tumor
38
(R,G) ? (M,A) Transformation
Transformed data (M,A)n1..5184 M log2(R/G)
(ratio), A log2(RG)1/2 1/2log2(RG)
(intensity) ? R(22AM)1/2, G(22A-M)1/2
39
Normalization
Dealing with sources of systematic error
40
Sources of Systematic Bias

Different dye labeling efficiencies
Scanning (laser and detector, chemistry of the
fluorescent label)
Differences in concentration of DNA on arrays
(plate effects)
Differences in total mRNA in one sample versus
another or mRNA degradation
Printing or tip problems
Uneven hybridization

41
Normalization

Reduces systematic (multiplicative) differences
between two channels of a single hybridization or
differences between hybridizations
Several Methods
Global mean method
(Iterative) linear regression method
Curvilinear methods (e.g. Lowess)
Variance model methods

Try to get a slope 1 and a correlation of 1
42
Example Where Normalization is Needed
43
Example Where Normalization is Not Needed
44
Normalization to a Global Mean

Calculate mean intensity of all spots in channels
1 2
e.g. ?ch2 25,000 ?ch2/?ch1 1.25
?ch1 20,000
On average, spots in ch2 are 1.25X brighter than
spots in ch1
To normalize, multiply spots in ch1 by 1.25

45
Normalization by Iterative Linear Regression

Fit a line (ymxb) to the data set
set aside outliers (residuals 2 x SD)

46
Background correction or not?

No background correction necessary

47
Prior to Lo(w)ess Normalization
48
Global (Loess) Normalization
49
A vs. M Plot
ratio log2 (Cy5 / Cy3)
0
average signal log2 (Cy3 Cy5)/2
50
Loess Function
Loess function fit line
ratio log2 (Cy5 / Cy3)
0
average signal log2 (Cy3 Cy5)/2
51
Data After Normalization
ratio log2 (Cy5 / Cy3)
0
average signal log2 (Cy3 Cy5)/2
52
Print-tip Normalization
Print-tip layout
53
Scaled Print-tip Normalization
After scaled print-tip normalization
After print-tip normalization
54
Non-systematic sources of variability
Noise in the system
55
Sources of variability

Day to day variation
Organism to organism variation
Array to array variation
Cross-hybridization

56
The trouble with cross-hybridization

With cross-hybridization, each probe will signal
the presence of multiple sequences other than
that it was designed for
This skews the observed data from the expected
data.

Observed expression profile vector
(cross-hybridized)
Expected expression profile vector (no
hybridization)
57
Cross-hybridization
Analysis of a cross-hybridization within the
CYP450 superfamily
Xu et al. (2001) Gene
58
Xhyb can be computationally predicted (and
avoided)

Good probe design avoid spotting genes with
areas of significant sequence overlap
20 bp of exact sequence identity
50 bp of over 90 sequence similarity
BLAST can be used to check similarity, but keep
in mind that its goal is to find optimal local
alignments not a set of areas most likely to
xhyb
Stringent wash to get rid of non-specific binding

59
Statistics
When does a difference make a difference?
60
Statistical significance tests

Parametric tests
T-test (P values)
Significance Analysis of Microarrays (modified
t-test, T values)
Mann-Whitney
Non-parametric
Wilcoxon Rank Sum Test
ANOVA (2 arrays, F-value)
Problems? n is always small

61
False Discovery

Statisticians call false positives a "type 1
error" or a "False Discovery"
False Discovery Rate (FDR) is equal to the
p-value of the t-test x the number of genes in
the array
For a p-value of 0.01 x 10,000 genes 100
false different genes
You cannot eliminate FPs, but stringent p-values
can keep them manageable (try p0.001)
The FDR must be smaller than the number of real
differences that you find - which in turn depends
on the size of the differences and variability of
the measured expression values

62
Bonferroni correction

To get a p-value of 0.05 when youre essentially
taking many many measurements, you must account
for these multiple measurements
10,000 genes x p-value of 0.05 500
false-positives
The level for statistical significance is divided
by the number of measurements

e.g., p
63
How Many Replicates?
Singletons Duplicates
3X

Substantial error when only one array analyzed,
standard is to use 3 replicates

Lee et al. (2000) PNAS
64
What Types of Replicates?
Biological replicates
Technical replicates
Biological replication is most important because
it includes all of the potential sources for error
65
Final Result
Highly Expressed Reduced Expression
Trx 16.8 Enh1 13.2 Hin2 11.8 P53 8.4 Calm
7.3 Ned3 5.6 P21 5.5 Antp 5.4 Gad2
5.2 Gad3 5.1 Erp3 5.0
GPD 0.11 Shn2 0.13 Alp4 0.22 OncB 0.23 Nrd1
0.25 LamR 0.26 SetH 0.30 LinK 0.32 Mrd2
0.32 Mrd3 0.33 TshR 0.34
66
Summary