Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 75
Provided by: Michael3711
Category:

less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops


1
Canadian Bioinformatics Workshops
  • www.bioinformatics.ca

2
2
Module Title of Module
3
Lecture 7Microarrays I Data Pre-Processing
MBP1010 Dr. Paul C. Boutros Winter 2015

Aegeus, King of Athens, consulting the Delphic
Oracle. High Classical (430 BCE)
DEPARTMENT OF MEDICAL BIOPHYSICS
This workshop includes material originally
developed by Drs. Raphael Gottardo, Sohrab
Shah, Boris Steipe and others

4
Lets start off with a question
What do expression microarrays actually measure?
5
What is a Microarray?
  • A DNA microarray is a multiplex technology
    consisting of thousands of oligonucleotide spots,
    each containing picomoles of a specific DNA
    sequence.
  • Used to quantitate mRNA or DNA
  • Many applications
  • mRNA or DNA levels
  • SNP identification
  • ChIP-on-Chip

6
Hypotheses
  • Microarrays are usually hypothesis-generating
  • They highlight specific genes or features that
    are particularly interesting for follow-up
    experiments
  • There are many interesting exceptions
  • Biomarkers
  • Pathway analyses
  • This does not reduce the importance of
    experimental design
  • the low statistical power of array studies make
    good design even more important and very
    challenging

7
Input Samples
The nature of the sample is critical Unfrozen
vs. Frozen vs. FFPE Total RNA vs. poly-A RNA
vs. other subsets
8
Microarray Basics
  • Imagine a one-spot microarray

Target DNA
is labeled
and hybridized
and washed.
Finally, scan the chip.
Target
Chip
Feature
Probe
9
These Are Spotted Arrays
Robotically printed onto a series of glass slides
using a robot with needle-heads.
Product a characteristic gridding pattern and
almost always use two samples simultaneously
(two-colour).
10
Other Types of Arrays
  • Inkjet Arrays
  • Photolithographically generated arrays
  • Bead arrays
  • Protein/cell/lipid-arrays
  • More niche applications
  • Not discussed here

11
InkJet Arrays
In 1999, HP spun off its life-science and
measurement division into Agilent Technologies.
The new company wanted to determine if printer
technology could be harnessed to generate
microarrays.
12
Inkjet Array Manufacture Involves Sequential
Nucleotide Addition
13
Photolithographic Arrays
  • Produced by the techniques for the production of
    transistors.
  • Mostly pioneered by the company Affymetrix,
    although other suppliers exist (e.g. Nimblegen)
  • We will be working with Affymetrix data later, so
    we will walk through the platform in significant
    detail

14
The Glass Matrix
  • Silination

Addition of Linker molecule
15
Photolithographic Synthesis
  • Photolithographic mask

16
Deprotection
17
Nucleotide Addition
18
Nucleotide Addition
19
Nucleotide Addition
20
Capping Agents
21
Final Chip
Wafer
Feature
Chip
22
(No Transcript)
23
RNA Wash
24
RNA Wash
25
An Affymetrix Microarray
26
Self-Assembling Bead-Arrays
  • Produced by Illumina
  • 3 µm silicon beads, randomly placed
  • coated with 105 identical 25bp probes
  • probes have identifying barcode (address)
    sequences

Labeled cDNA
bead
address
probe
27
Comparing Array Platforms
Data Quality
Price
Oligos
Bioinformatics Research
Platform
Spotted cDNA

variable


Affymetrix

25 bp



70 bp


Inkjet
Bead Arrays

25 bp


I do not endorse specific platforms they all
have their strengths and weaknesses
28
Each Spot is a Probe
A) Remove Noise
Quantitation
B) Extract Data
?
29
Step 1 Image Quantitation
  • Why? Quantitative vs. Qualitative
  • How? Image Segmentation
  • Difficulty?
  • Research?

30
Image Segmentation 101 Find Grids
1. Find Grids
2. Find Spots
3. Spot Outline
31
Image Segmentation 101 Find Spots
Key Step Integrate Signal Across Array
32
Image Segmentation 101 Challenges
Problems Stray Signal Missing Spots Gross
Deformities Manual Validation
33
Research?
  • Surprisingly, not much investigation
  • This is probably a source of error in all studies
  • Manual checking of spot-detection remains the
    norm
  • Problematic as studies arrays get larger

34
Quantitation
?
35
Step 2 Background Correction
  • Why? Remove Stray
  • Signal
  • How? Model-based
  • Difficulty?
  • Research?

36
Spot Segmentation
Signal
???
Background
37
So what do we get?
Background Intensity BG
Foreground Intensity FG
If BG gt FG Then -ve Signal
NO!
Isnt it simple? Signal FG - BG
0.1-2 of spots
38
Why Might This happen?
In 2001 two papers showed that empty spots have
less signal than background
Unbound spots correspond to low-expression genes
Background Intensity BG
Foreground Intensity FG
Thus unbound spots are particularly prone to
problems
39
So What to Do?
  • Heavy-duty mathematical tools employed
  • Three major models developed
  • Edwards log-linear
  • Smyth normexp
  • Kooperberg Bayesian

The math is extremely advanced, so well skip
that for now. Lets summarize the methods instead.
40
Comparison
Speed
Accuracy
Method
Edwards
Fast
Good
Better
NormExp
Slow
Kooperberg
Very Slow
Best
No strong criteria for selecting between these
algorithms.
41
Quantitation
?
42
Step 3 Spot Quality
  • Why? Identify artefacts
  • How? Unknown
  • Difficulty?
  • Research?

43
Spot-Weighting
  • A perfect spot is used normally in analysis
  • Weight 1
  • A poor spot is given less consideration
  • 0 lt Weight lt 1

Problem How the heck do we calculate weights?
44
A Few Approaches
  • Mean-Median Correlation
  • Composite q-metrics
  • ? improve homotypic signalnoise
  • But both fail sometimes, seemingly randomly.

Do we really need this?
45
All from one good-quality array!
46
But I use Affymetrix!(Or Agilent)(Or
Nimblegen)(Or Other Commercial Supplier)
47
Okay, Lets See Some Affy Data
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
Those Three Were From A Spike-In Experiment Done
by Affymetrix Themselves!
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
Spot Quality is An Issue, Regardless of Platform
56
Manual Flagging?
  • Two studies show error rates of 5-20

Spot-Quality is a huge, unsolved problem. Most
investigators ignore it. More bioinformaticians
struggle with it.
Then we ignore it too.
57
Quantitation
?
58
Step 4 Intra-Array Normalization
Why? Balance channels Remove spatial
artifacts How? Multiple robust
algorithms Difficulty? Research?
59
Within-Array Normalization
  • 1. Spatial gradients
  • 2. Channel-balancing
  • 3. Intensity bias

Are red and green equal in our starting sample?
60
We Can Handle This!
  • Spatial Effects
  • Gaussian Spatial Smoothing
  • Intensity Effects
  • Loess Smoothing
  • Combination Effects
  • Robust Splines

All methods well-established
61
Quantitation
?
62
Step 5 Inter-Array Normalization
Why? Balance arrays How? Multiple robust
algorithms Difficulty? Research?
63
Balancing Arrays
  • Problem
  • Pipette error can lead to differential loading
    of sample between arrays
  • Solution
  • Scale arrays

Extremely easy to handle
64
Scaling Has a Major Effect
Before
After
Intensity
p(I)
65
Quantitation
?
66
Significance Testing
Why? Find spots that change How? Statistical
tests Difficulty? Research?
67
Significance Testing Questions
  1. Are these two groups different?
  2. Do these two things synergize?
  3. Does treatment affect patient outcome?
  4. Can we predict clinical features?

In your assignment we will focus on 1, and a
little 4
68
Quantitation
?
69
Clustering
Why? Finding patterns in the data How? Unsupervi
sed machine-learning Difficulty? Research?
70
Why is Clustering Used?
  1. Data visualization
  2. To predict class assignment
  3. To identify co-regulation
  4. Quality Control

71
Example Predicting Gene Function
  • Most genes have NO functional annotation
  • 1,500 / 7,000 yeast genes
  • 12,000 / 20,000 human genes
  • Can we automatically estimate their function
    based on their patterns of expression?

72
Solution Clustering of Expression Profiles
Hughes et al Cell 2000
Tissues
73
Abuses of Clustering?
  • Clustering pre-selected data
  • Clustering after significance analysis is only
    for visualization
  • Detecting differential expression
  • Clustering cannot replace significance-testing
  • No assessment of chance
  • How likely is a given pattern to be observed by
    chance alone? Statistics exist to test this!

74
Course Overview
  • Lecture 1 What is Statistics? Introduction to R
  • Lecture 2 Univariate Analyses I continuous
  • Lecture 3 Univariate Analyses II discrete
  • Lecture 4 Multivariate Analyses I specialized
    models
  • Lecture 5 Multivariate Analyses II general
    models
  • Lecture 6 Sequence Analysis
  • Lecture 7 Microarray Analysis I Pre-Processing
  • Lecture 8 Microarray Analysis II
    Multiple-Testing
  • Lecture 9 Machine-Learning
  • Final Exam (written)
Write a Comment
User Comments (0)
About PowerShow.com