Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 74

About This Presentation

Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca – PowerPoint PPT presentation

Number of Views:127

Avg rating:3.0/5.0

Slides: 75

Provided by: Michael3711

Category:

more less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops

1
Canadian Bioinformatics Workshops

www.bioinformatics.ca

2
2
Module Title of Module
3
Lecture 7Microarrays I Data Pre-Processing
MBP1010 Dr. Paul C. Boutros Winter 2015

Aegeus, King of Athens, consulting the Delphic
Oracle. High Classical (430 BCE)
DEPARTMENT OF MEDICAL BIOPHYSICS
This workshop includes material originally
developed by Drs. Raphael Gottardo, Sohrab
Shah, Boris Steipe and others

4
Lets start off with a question
What do expression microarrays actually measure?
5
What is a Microarray?

A DNA microarray is a multiplex technology
consisting of thousands of oligonucleotide spots,
each containing picomoles of a specific DNA
sequence.
Used to quantitate mRNA or DNA
Many applications
mRNA or DNA levels
SNP identification
ChIP-on-Chip

6
Hypotheses

Microarrays are usually hypothesis-generating
They highlight specific genes or features that
are particularly interesting for follow-up
experiments
There are many interesting exceptions
Biomarkers
Pathway analyses
This does not reduce the importance of
experimental design
the low statistical power of array studies make
good design even more important and very
challenging

7
Input Samples
The nature of the sample is critical Unfrozen
vs. Frozen vs. FFPE Total RNA vs. poly-A RNA
vs. other subsets
8
Microarray Basics

Imagine a one-spot microarray

Target DNA
is labeled
and hybridized
and washed.
Finally, scan the chip.
Target
Chip
Feature
Probe
9
These Are Spotted Arrays
Robotically printed onto a series of glass slides
using a robot with needle-heads.
Product a characteristic gridding pattern and
almost always use two samples simultaneously
(two-colour).
10
Other Types of Arrays

Inkjet Arrays
Photolithographically generated arrays
Bead arrays
Protein/cell/lipid-arrays
More niche applications
Not discussed here

11
InkJet Arrays
In 1999, HP spun off its life-science and
measurement division into Agilent Technologies.
The new company wanted to determine if printer
technology could be harnessed to generate
microarrays.
12
Inkjet Array Manufacture Involves Sequential
Nucleotide Addition
13
Photolithographic Arrays

Produced by the techniques for the production of
transistors.
Mostly pioneered by the company Affymetrix,
although other suppliers exist (e.g. Nimblegen)
We will be working with Affymetrix data later, so
we will walk through the platform in significant
detail

14
The Glass Matrix

Silination

Addition of Linker molecule
15
Photolithographic Synthesis

Photolithographic mask

16
Deprotection
17
Nucleotide Addition
18
Nucleotide Addition
19
Nucleotide Addition
20
Capping Agents
21
Final Chip
Wafer
Feature
Chip
22
(No Transcript)
23
RNA Wash
24
RNA Wash
25
An Affymetrix Microarray
26
Self-Assembling Bead-Arrays

Produced by Illumina
3 µm silicon beads, randomly placed
coated with 105 identical 25bp probes
probes have identifying barcode (address)
sequences

Labeled cDNA
bead
address
probe
27
Comparing Array Platforms
Data Quality
Price
Oligos
Bioinformatics Research
Platform
Spotted cDNA

variable

Affymetrix

25 bp

70 bp

Inkjet
Bead Arrays

25 bp

I do not endorse specific platforms they all
have their strengths and weaknesses
28
Each Spot is a Probe
A) Remove Noise
Quantitation
B) Extract Data
?
29
Step 1 Image Quantitation

Why? Quantitative vs. Qualitative
How? Image Segmentation
Difficulty?
Research?

30
Image Segmentation 101 Find Grids
1. Find Grids
2. Find Spots
3. Spot Outline
31
Image Segmentation 101 Find Spots
Key Step Integrate Signal Across Array
32
Image Segmentation 101 Challenges
Problems Stray Signal Missing Spots Gross
Deformities Manual Validation
33
Research?

Surprisingly, not much investigation
This is probably a source of error in all studies
Manual checking of spot-detection remains the
norm
Problematic as studies arrays get larger

34
Quantitation
?
35
Step 2 Background Correction

Why? Remove Stray
Signal
How? Model-based
Difficulty?
Research?

36
Spot Segmentation
Signal
???
Background
37
So what do we get?
Background Intensity BG
Foreground Intensity FG
If BG gt FG Then -ve Signal
NO!
Isnt it simple? Signal FG - BG
0.1-2 of spots
38
Why Might This happen?
In 2001 two papers showed that empty spots have
less signal than background
Unbound spots correspond to low-expression genes
Background Intensity BG
Foreground Intensity FG
Thus unbound spots are particularly prone to
problems
39
So What to Do?

Heavy-duty mathematical tools employed
Three major models developed
Edwards log-linear
Smyth normexp
Kooperberg Bayesian

The math is extremely advanced, so well skip
that for now. Lets summarize the methods instead.
40
Comparison
Speed
Accuracy
Method
Edwards
Fast
Good
Better
NormExp
Slow
Kooperberg
Very Slow
Best
No strong criteria for selecting between these
algorithms.
41
Quantitation
?
42
Step 3 Spot Quality

Why? Identify artefacts
How? Unknown
Difficulty?
Research?

43
Spot-Weighting

A perfect spot is used normally in analysis
Weight 1
A poor spot is given less consideration
0 lt Weight lt 1

Problem How the heck do we calculate weights?
44
A Few Approaches

Mean-Median Correlation
Composite q-metrics
? improve homotypic signalnoise
But both fail sometimes, seemingly randomly.

Do we really need this?
45
All from one good-quality array!
46
But I use Affymetrix!(Or Agilent)(Or
Nimblegen)(Or Other Commercial Supplier)
47
Okay, Lets See Some Affy Data
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
Those Three Were From A Spike-In Experiment Done
by Affymetrix Themselves!
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
Spot Quality is An Issue, Regardless of Platform
56
Manual Flagging?

Two studies show error rates of 5-20

Spot-Quality is a huge, unsolved problem. Most
investigators ignore it. More bioinformaticians
struggle with it.
Then we ignore it too.
57
Quantitation
?
58
Step 4 Intra-Array Normalization
Why? Balance channels Remove spatial
artifacts How? Multiple robust
algorithms Difficulty? Research?
59
Within-Array Normalization

1. Spatial gradients
2. Channel-balancing
3. Intensity bias

Are red and green equal in our starting sample?
60
We Can Handle This!

Spatial Effects
Gaussian Spatial Smoothing
Intensity Effects
Loess Smoothing
Combination Effects
Robust Splines

All methods well-established
61
Quantitation
?
62
Step 5 Inter-Array Normalization
Why? Balance arrays How? Multiple robust
algorithms Difficulty? Research?
63
Balancing Arrays

Problem
Pipette error can lead to differential loading
of sample between arrays
Solution
Scale arrays

Extremely easy to handle
64
Scaling Has a Major Effect
Before
After
Intensity
p(I)
65
Quantitation
?
66
Significance Testing
Why? Find spots that change How? Statistical
tests Difficulty? Research?
67
Significance Testing Questions

Are these two groups different?
Do these two things synergize?
Does treatment affect patient outcome?
Can we predict clinical features?

In your assignment we will focus on 1, and a
little 4
68
Quantitation
?
69
Clustering
Why? Finding patterns in the data How? Unsupervi
sed machine-learning Difficulty? Research?
70
Why is Clustering Used?

Data visualization
To predict class assignment
To identify co-regulation
Quality Control

71
Example Predicting Gene Function

Most genes have NO functional annotation
1,500 / 7,000 yeast genes
12,000 / 20,000 human genes
Can we automatically estimate their function
based on their patterns of expression?

72
Solution Clustering of Expression Profiles
Hughes et al Cell 2000
Tissues
73
Abuses of Clustering?

Clustering pre-selected data
Clustering after significance analysis is only
for visualization
Detecting differential expression
Clustering cannot replace significance-testing
No assessment of chance
How likely is a given pattern to be observed by
chance alone? Statistics exist to test this!

74
Course Overview