Title: Canadian Bioinformatics Workshops
1Canadian Bioinformatics Workshops
22
Module Title of Module
3Lecture 7Microarrays I Data Pre-Processing
MBP1010 Dr. Paul C. Boutros Winter 2015
Aegeus, King of Athens, consulting the Delphic
Oracle. High Classical (430 BCE)
DEPARTMENT OF MEDICAL BIOPHYSICS
This workshop includes material originally
developed by Drs. Raphael Gottardo, Sohrab
Shah, Boris Steipe and others
4Lets start off with a question
What do expression microarrays actually measure?
5What is a Microarray?
- A DNA microarray is a multiplex technology
consisting of thousands of oligonucleotide spots,
each containing picomoles of a specific DNA
sequence. - Used to quantitate mRNA or DNA
- Many applications
- mRNA or DNA levels
- SNP identification
- ChIP-on-Chip
6Hypotheses
- Microarrays are usually hypothesis-generating
- They highlight specific genes or features that
are particularly interesting for follow-up
experiments - There are many interesting exceptions
- Biomarkers
- Pathway analyses
- This does not reduce the importance of
experimental design - the low statistical power of array studies make
good design even more important and very
challenging
7Input Samples
The nature of the sample is critical Unfrozen
vs. Frozen vs. FFPE Total RNA vs. poly-A RNA
vs. other subsets
8Microarray Basics
- Imagine a one-spot microarray
Target DNA
is labeled
and hybridized
and washed.
Finally, scan the chip.
Target
Chip
Feature
Probe
9These Are Spotted Arrays
Robotically printed onto a series of glass slides
using a robot with needle-heads.
Product a characteristic gridding pattern and
almost always use two samples simultaneously
(two-colour).
10Other Types of Arrays
- Inkjet Arrays
- Photolithographically generated arrays
- Bead arrays
- Protein/cell/lipid-arrays
- More niche applications
- Not discussed here
11InkJet Arrays
In 1999, HP spun off its life-science and
measurement division into Agilent Technologies.
The new company wanted to determine if printer
technology could be harnessed to generate
microarrays.
12Inkjet Array Manufacture Involves Sequential
Nucleotide Addition
13Photolithographic Arrays
- Produced by the techniques for the production of
transistors. - Mostly pioneered by the company Affymetrix,
although other suppliers exist (e.g. Nimblegen) - We will be working with Affymetrix data later, so
we will walk through the platform in significant
detail
14The Glass Matrix
Addition of Linker molecule
15Photolithographic Synthesis
16Deprotection
17Nucleotide Addition
18Nucleotide Addition
19Nucleotide Addition
20Capping Agents
21Final Chip
Wafer
Feature
Chip
22(No Transcript)
23RNA Wash
24RNA Wash
25An Affymetrix Microarray
26Self-Assembling Bead-Arrays
- Produced by Illumina
- 3 µm silicon beads, randomly placed
- coated with 105 identical 25bp probes
- probes have identifying barcode (address)
sequences
Labeled cDNA
bead
address
probe
27Comparing Array Platforms
Data Quality
Price
Oligos
Bioinformatics Research
Platform
Spotted cDNA
variable
Affymetrix
25 bp
70 bp
Inkjet
Bead Arrays
25 bp
I do not endorse specific platforms they all
have their strengths and weaknesses
28Each Spot is a Probe
A) Remove Noise
Quantitation
B) Extract Data
?
29Step 1 Image Quantitation
- Why? Quantitative vs. Qualitative
- How? Image Segmentation
- Difficulty?
- Research?
30Image Segmentation 101 Find Grids
1. Find Grids
2. Find Spots
3. Spot Outline
31Image Segmentation 101 Find Spots
Key Step Integrate Signal Across Array
32Image Segmentation 101 Challenges
Problems Stray Signal Missing Spots Gross
Deformities Manual Validation
33Research?
- Surprisingly, not much investigation
- This is probably a source of error in all studies
- Manual checking of spot-detection remains the
norm - Problematic as studies arrays get larger
34Quantitation
?
35Step 2 Background Correction
- Why? Remove Stray
- Signal
- How? Model-based
- Difficulty?
- Research?
36Spot Segmentation
Signal
???
Background
37So what do we get?
Background Intensity BG
Foreground Intensity FG
If BG gt FG Then -ve Signal
NO!
Isnt it simple? Signal FG - BG
0.1-2 of spots
38Why Might This happen?
In 2001 two papers showed that empty spots have
less signal than background
Unbound spots correspond to low-expression genes
Background Intensity BG
Foreground Intensity FG
Thus unbound spots are particularly prone to
problems
39So What to Do?
- Heavy-duty mathematical tools employed
- Three major models developed
- Edwards log-linear
- Smyth normexp
- Kooperberg Bayesian
The math is extremely advanced, so well skip
that for now. Lets summarize the methods instead.
40Comparison
Speed
Accuracy
Method
Edwards
Fast
Good
Better
NormExp
Slow
Kooperberg
Very Slow
Best
No strong criteria for selecting between these
algorithms.
41Quantitation
?
42Step 3 Spot Quality
- Why? Identify artefacts
- How? Unknown
- Difficulty?
- Research?
43Spot-Weighting
- A perfect spot is used normally in analysis
- Weight 1
- A poor spot is given less consideration
- 0 lt Weight lt 1
Problem How the heck do we calculate weights?
44A Few Approaches
- Mean-Median Correlation
- Composite q-metrics
- ? improve homotypic signalnoise
- But both fail sometimes, seemingly randomly.
Do we really need this?
45All from one good-quality array!
46But I use Affymetrix!(Or Agilent)(Or
Nimblegen)(Or Other Commercial Supplier)
47Okay, Lets See Some Affy Data
48(No Transcript)
49(No Transcript)
50(No Transcript)
51Those Three Were From A Spike-In Experiment Done
by Affymetrix Themselves!
52(No Transcript)
53(No Transcript)
54(No Transcript)
55Spot Quality is An Issue, Regardless of Platform
56Manual Flagging?
- Two studies show error rates of 5-20
Spot-Quality is a huge, unsolved problem. Most
investigators ignore it. More bioinformaticians
struggle with it.
Then we ignore it too.
57Quantitation
?
58Step 4 Intra-Array Normalization
Why? Balance channels Remove spatial
artifacts How? Multiple robust
algorithms Difficulty? Research?
59Within-Array Normalization
- 1. Spatial gradients
- 2. Channel-balancing
- 3. Intensity bias
Are red and green equal in our starting sample?
60We Can Handle This!
- Spatial Effects
- Gaussian Spatial Smoothing
- Intensity Effects
- Loess Smoothing
- Combination Effects
- Robust Splines
All methods well-established
61Quantitation
?
62Step 5 Inter-Array Normalization
Why? Balance arrays How? Multiple robust
algorithms Difficulty? Research?
63Balancing Arrays
- Problem
- Pipette error can lead to differential loading
of sample between arrays - Solution
- Scale arrays
Extremely easy to handle
64Scaling Has a Major Effect
Before
After
Intensity
p(I)
65Quantitation
?
66Significance Testing
Why? Find spots that change How? Statistical
tests Difficulty? Research?
67Significance Testing Questions
- Are these two groups different?
- Do these two things synergize?
- Does treatment affect patient outcome?
- Can we predict clinical features?
In your assignment we will focus on 1, and a
little 4
68Quantitation
?
69Clustering
Why? Finding patterns in the data How? Unsupervi
sed machine-learning Difficulty? Research?
70Why is Clustering Used?
- Data visualization
- To predict class assignment
- To identify co-regulation
- Quality Control
71Example Predicting Gene Function
- Most genes have NO functional annotation
- 1,500 / 7,000 yeast genes
- 12,000 / 20,000 human genes
- Can we automatically estimate their function
based on their patterns of expression?
72Solution Clustering of Expression Profiles
Hughes et al Cell 2000
Tissues
73Abuses of Clustering?
- Clustering pre-selected data
- Clustering after significance analysis is only
for visualization - Detecting differential expression
- Clustering cannot replace significance-testing
- No assessment of chance
- How likely is a given pattern to be observed by
chance alone? Statistics exist to test this!
74Course Overview
- Lecture 1 What is Statistics? Introduction to R
- Lecture 2 Univariate Analyses I continuous
- Lecture 3 Univariate Analyses II discrete
- Lecture 4 Multivariate Analyses I specialized
models - Lecture 5 Multivariate Analyses II general
models - Lecture 6 Sequence Analysis
- Lecture 7 Microarray Analysis I Pre-Processing
- Lecture 8 Microarray Analysis II
Multiple-Testing - Lecture 9 Machine-Learning
- Final Exam (written)