Bioinformatics: Applications - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Bioinformatics: Applications

Description:

Intensity extraction (for each spot) Foreground fluorescence intensity pairs (R, G) ... Spot Picking. Classification of pixels as foreground or background ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 68
Provided by: jonath76
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics: Applications


1
Bioinformatics Applications
  • ZOO 4903
  • Fall 2006, MW 1030-1145
  • Sutton Hall, Room 312
  • Microarrays basic data analysis methods

2
Lecture overview
  • What weve talked about so far
  • Genes gene expression
  • Microarrays measuring the entire transcriptome
  • Overview
  • Image processing
  • Data normalization
  • Statistics

3
Image Processing
  • 25,000 genes
  • 50,000 measurements
  • Chips

4
Microarray Images
  • Resolution
  • standard 10?m (100,000 atoms wide)
  • 100?m spot on chip 10 pixels in diameter
  • Image format
  • TIFF 16 bit (64K grey levels)
  • 1cm x 1cm image at 16 bit 2Mb (uncompressed)
  • Separate image for each fluorescent sample
  • channel 1, channel 2

5
Images are scanned separately and combined
Laser 1
Laser 2
Green channel
Red channel
Overlay images and normalize
Scan and detect with confocal laser system
Image process and analyze
6
Image Processing
  • Addressing or gridding
  • Assigning coordinates to each of the spots
  • Segmentation or spot picking
  • Classifying pixels either as foreground or as
    background
  • Intensity extraction (for each spot)
  • Foreground fluorescence intensity pairs (R, G)
  • Background intensities
  • Quality measures

7
Overview
Raw (combined) image
Gridded
Spots picked flagged
8
Gridding Errors
Spotting errors
Uneven print tip hybridization
Gridding errors
9
Spot Picking
  • Classification of pixels as foreground or
    background
  • Large selection of methods available, each has
    strengths weaknesses

10
Spot Picking
  • Segmentation/spot picking methods
  • Fixed circle segmentation
  • Adaptive circle segmentation
  • Adaptive shape segmentation
  • Histogram segmentation

11
Fixed Circle Segmentation
12
Adaptive Circle Segmentation
  • Circle diameter is estimated separately for each
    spot
  • GenePix finds spots by detecting edges of spots
    (second derivative)

13
Adaptive Circle Segmentation
14
Question
  • Q What happens with spot finding algorithms if
    the spot on the microarray is irregular (i.e.,
    not a circle)

15
Question
  • Q What happens with spot finding algorithms if
    the spot on the microarray is irregular (i.e.,
    not a circle)
  • A Pixels are misassigned background is counted
    as signal and vice versa

16
Adaptive Shape Segmentation
Edge detection or Seeded Region Growing Regions
grow outwards from the seed points preferentially
according to the difference between a pixels
value and the running mean of values in an
adjoining region
17
Information Extraction
  • Spot Intensities
  • mean (pixel intensities)
  • median (pixel intensities)
  • Background values
  • Local Background
  • Morphological opening
  • Constant (global)
  • Quality Information

Take the average
18
Spot morphology does not affect dynamic range
  • The red line indicates signal level for
    non-spiked target.
  • Error bars represent one standard deviation for
    each mean (n18) signal

19
Spot Intensity
  • The total amount of hybridization for a spot is
    proportional to the total fluorescence at the
    spot
  • Spot intensity pixel intensities within a spot
  • Later calculations are based on ratios between
    Cy5 and Cy3, so we tally in some way the
    intensity of the spot
  • Can use ratios of medians, means or even modes
    (if binned)
  • Non-specific hybridization subtracted (area
    outside the spot)

20
Mean, Median Mode
Mode
Median
Mean
21
Background Intensity
  • A spots measured intensity includes a
    contribution of non-specific hybridization and
    other chemicals on the glass
  • Fluorescence intensity from regions not occupied
    by DNA can be different from regions occupied by
    DNA

22
Local Background Detection
  • Focuses on small regions around spot mask
  • Determine median/mean pixel values in this region
  • Most common approach
  • By not considering the pixels immediately
    surrounding the spots, the background estimate is
    less sensitive to the performance of the
    segmentation procedure

23
Quality problems
Irregular Spot Comet Tail Streaking
Hi Background Low Intensity OK
24
Quality Measurements
  • Array
  • Correlation between duplicate spot intensities
  • Percentage of spots with negative signals
  • Distribution of actual spot signal area vs.
    idealized
  • Inter-array consistency
  • Spot
  • Signal / Noise ratio
  • Variation in pixel intensities within spots

25
Visualizing the expression data
  • A pretty picture is not enough

26
Log Transformation
linear scale
log2 scale
expt A
ch2 intensity
27
Choice of Base is Not Important
log10
ln
28
Why Log Transform?
  • Makes variation of intensities and ratios of
    intensities more independent of absolute
    magnitude
  • Evens out highly skewed distributions
  • Gives more realistic sense of variation
  • Approximates normal distribution
  • Treats up- and down- regulated genes
    symmetrically

29
Log scores are symmetric
0.1 1.0
10
Linear
Same data
-1 0
1
Log10
30
Log scores better visualize variation in both
directions
31
A Microarray Scatter Plot
32
Correlation
Comet-tailing from non- balanced channels
Cy5 (red) intensity
Cy3 (green) intensity
Linear Non-linear
33
Correlation
correlation Uncorrelated -
correlation
34
Correlation
High correlation
Low correlation
Perfect correlation
35
Correlation Coefficient
r 0.85
r 0.4
r 1.0
36
Correlation and Outliers
Experimental error or something important?
A single bad point can affect a good
correlation, and the problem with microarrays is
that we are expecting bad points
37
Normal vs. Normal
Normal vs. Tumor
38
(R,G) ? (M,A) Transformation
Transformed data (M,A)n1..5184 M log2(R/G)
(ratio), A log2(RG)1/2 1/2log2(RG)
(intensity) ? R(22AM)1/2, G(22A-M)1/2
39
Normalization
Dealing with sources of systematic error
40
Sources of Systematic Bias
  • Different dye labeling efficiencies
  • Scanning (laser and detector, chemistry of the
    fluorescent label)
  • Differences in concentration of DNA on arrays
    (plate effects)
  • Differences in total mRNA in one sample versus
    another or mRNA degradation
  • Printing or tip problems
  • Uneven hybridization

41
Normalization
  • Reduces systematic (multiplicative) differences
    between two channels of a single hybridization or
    differences between hybridizations
  • Several Methods
  • Global mean method
  • (Iterative) linear regression method
  • Curvilinear methods (e.g. Lowess)
  • Variance model methods

Try to get a slope 1 and a correlation of 1
42
Example Where Normalization is Needed
43
Example Where Normalization is Not Needed
44
Normalization to a Global Mean
  • Calculate mean intensity of all spots in channels
    1 2
  • e.g. ?ch2 25,000 ?ch2/?ch1 1.25
  • ?ch1 20,000
  • On average, spots in ch2 are 1.25X brighter than
    spots in ch1
  • To normalize, multiply spots in ch1 by 1.25

45
Normalization by Iterative Linear Regression
  • Fit a line (ymxb) to the data set
  • set aside outliers (residuals 2 x SD)

46
Background correction or not?
  • No background correction necessary

47
Prior to Lo(w)ess Normalization
48
Global (Loess) Normalization
49
A vs. M Plot
ratio log2 (Cy5 / Cy3)
0
average signal log2 (Cy3 Cy5)/2
50
Loess Function
Loess function fit line
ratio log2 (Cy5 / Cy3)
0
average signal log2 (Cy3 Cy5)/2
51
Data After Normalization
ratio log2 (Cy5 / Cy3)
0
average signal log2 (Cy3 Cy5)/2
52
Print-tip Normalization
Print-tip layout
53
Scaled Print-tip Normalization
After scaled print-tip normalization
After print-tip normalization
54
Non-systematic sources of variability
Noise in the system
55
Sources of variability
  • Day to day variation
  • Organism to organism variation
  • Array to array variation
  • Cross-hybridization

56
The trouble with cross-hybridization
  • With cross-hybridization, each probe will signal
    the presence of multiple sequences other than
    that it was designed for
  • This skews the observed data from the expected
    data.



Observed expression profile vector
(cross-hybridized)
Expected expression profile vector (no
hybridization)
57
Cross-hybridization
Analysis of a cross-hybridization within the
CYP450 superfamily
Xu et al. (2001) Gene
58
Xhyb can be computationally predicted (and
avoided)
  • Good probe design avoid spotting genes with
    areas of significant sequence overlap
  • 20 bp of exact sequence identity
  • 50 bp of over 90 sequence similarity
  • BLAST can be used to check similarity, but keep
    in mind that its goal is to find optimal local
    alignments not a set of areas most likely to
    xhyb
  • Stringent wash to get rid of non-specific binding

59
Statistics
When does a difference make a difference?
60
Statistical significance tests
  • Parametric tests
  • T-test (P values)
  • Significance Analysis of Microarrays (modified
    t-test, T values)
  • Mann-Whitney
  • Non-parametric
  • Wilcoxon Rank Sum Test
  • ANOVA (2 arrays, F-value)
  • Problems? n is always small

61
False Discovery
  • Statisticians call false positives a "type 1
    error" or a "False Discovery"
  • False Discovery Rate (FDR) is equal to the
    p-value of the t-test x the number of genes in
    the array
  • For a p-value of 0.01 x 10,000 genes 100
    false different genes
  • You cannot eliminate FPs, but stringent p-values
    can keep them manageable (try p0.001)
  • The FDR must be smaller than the number of real
    differences that you find - which in turn depends
    on the size of the differences and variability of
    the measured expression values

62
Bonferroni correction
  • To get a p-value of 0.05 when youre essentially
    taking many many measurements, you must account
    for these multiple measurements
  • 10,000 genes x p-value of 0.05 500
    false-positives
  • The level for statistical significance is divided
    by the number of measurements

e.g., p
63
How Many Replicates?
Singletons Duplicates
3X
  • Substantial error when only one array analyzed,
    standard is to use 3 replicates

Lee et al. (2000) PNAS
64
What Types of Replicates?
Biological replicates
Technical replicates
Biological replication is most important because
it includes all of the potential sources for error
65
Final Result
Highly Expressed Reduced Expression
Trx 16.8 Enh1 13.2 Hin2 11.8 P53 8.4 Calm
7.3 Ned3 5.6 P21 5.5 Antp 5.4 Gad2
5.2 Gad3 5.1 Erp3 5.0
GPD 0.11 Shn2 0.13 Alp4 0.22 OncB 0.23 Nrd1
0.25 LamR 0.26 SetH 0.30 LinK 0.32 Mrd2
0.32 Mrd3 0.33 TshR 0.34
66
Summary
  • Data analysis begins with good image processing
  • Sources of experimental variation lead to the
    need to normalize data
  • One type of analysis involves clustering similar
    expression profiles

67
For next time
  • Nothing this time ?
Write a Comment
User Comments (0)
About PowerShow.com