Fusing Results from Microarray Experiments - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Fusing Results from Microarray Experiments

Description:

A microarray is a collection of thousands of small test locations, arranged in a ... Now use the D=XCe model to estimate C. But how do we know the answer we get ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 32

Provided by: matthewd9

Category:

more less

Transcript and Presenter's Notes

Title: Fusing Results from Microarray Experiments

1
Fusing Results from Microarray Experiments

Matt.Boardman_at_dal.ca
http//www.cs.dal.ca/boardman

2
Summary

Primary paper
Gilks et al, Fusing microarray experiments with
multivariate regression, Bioinformatics, 2005.
The basic idea
Microarray experiments are subject to noise and
variation
Regression model fuses data from several
microarray tests
Unique visualization!
Application
Rustici et al, Periodic gene expression program
of the fission yeast cell cycle, Nature
Genetics, 2004.

3
Agenda

Introduction to Microarrays
Regression Model
Experimental Procedures
Visualization of Results
Project Proposal

4
DNA Transcription to mRNA
Orengo et al, Figure 1.1
5
DNA Microarrays

A microarray is a collection of thousands of
small test locations, arranged in a 1 x 3
array.
Each test location has a small fragment of DNA,
called a probe (about 20-70 bases), which
corresponds to a particular gene.
Fragments of mRNA (recently transcribed messenger
RNA) from a test subject bind to each probe.
We measure the quantity of mRNA that sticks to
each probe, to determine how much mRNA for that
gene is present in the sample.

http//www.agilent.com/about/newsroom/lsca/imageli
brary/index_2003.html
6
DNA Microarrays

Flash demo
http//www.bio.davidson.edu/Courses/genomics/chi
p/chipQ.html

7
DNA Microarrays

Slide Manufacturers
Agilent (HP spinoff)
Amersham Codelink
Corning CMT GAPS II
Erie Sciences Gold Seal

Scanner Manufacturers
Affymetrix
Agilent
Applied Precision
Asper Biotech
Axon
Molecular Devices
National Instruments
Vidar
Commercial Software
Axon/GenePix
GeneExplorer
Iobion-Stratagene/GeneTraffic
Rosetta Resolver
Spotfire
SGI/GeneSpring

http//microarrays.ucsd.edu/biogem/resources/image
s/agilent_scanner.jpg
8
DNA Microarrays

Sources of error
gene-specific dye bias
probe design and manufacturing
heterogeneity in source material (The Fly!)
glass surface abnormalities (warpage, curvature)
variations in glass thickness
slide movement within scanner
slide manufacturing quality
mRNA deterioration
Remedies
daily calibration
dynamic autofocus (Agilent)
software fixes (e.g. normalization)
repeat, repeat, repeat

http//www.moleculardevices.com/pages/instruments/
gn_genepix4000.html
9
Multivariate Regression Model

Microarray test repetition
Different laboratories
Different slides / scanners / software
Different procedures for sample preparation
Authors propose a new model to combine data from
multiple microarray tests
No need to infer the causes of error
Automatically filter out noise and artefacts
Iteratively weight each test based on quality of
results
Avoid polluting high-quality results with lower
quality data
Deliver fused and cleaned dataset for further
analysis

10
Multivariate Regression Model

Let
N be the number of microarray tests
m be the number of genes in each microarray
n be the number of hypothetical cell types under
test
Note
Typically m N
We dont know n, but we assume n lt N

11
Multivariate Regression Model
Gilks et al, Equation 1

Where
D is the matrix of observations the actual
microarray tests
X is a matrix of weights, uniquely designed for
each experiment
C is the ideal, perfect microarray test with no
variation or noise
e contains unknown residual errors and noise
So
D are the warped, noisy observations of the
perfect microarray test C

12
Experiment Periodic Cell-Cycles in Yeast

Question
Which genes are involved in cell reproduction in
yeast?
Schizosaccharomyces pombe (fission yeast)
Nine experiments were designed in order to
synchronize the cell cycles in yeast
centrifugal elutriation
cdc25 block-release
combinations of both methods
Microarrays taken every 15 minutes, for roughly
two cell cycles (about 5.5 hours)

13
Experiment Periodic Cell-Cycles in Yeast

Goal
Fuse these nine different experiments into one
ideal
Result will be a set of microarray results for
one cell-cycle
Problem
Different synchronization methods ? different
cell-cycles
Experiments are not exactly in phase with each
other
Experiments result in different cell-cycle
lengths

14
Experiment Periodic Cell-Cycles in Yeast

These nine experiments produce N178 microarray
tests
Each microarray test has m407 genes
Selected since they are identified as periodic in
cell-cycle
136 of these show significant changes during
cycle
Define an ideal cell-cycle, divided into n10
fusion times
Each microarray test will be at a different
angle in the ideal cycle
The coefficients in X are chosen to weight the
relevance of each microarray test to each of the
fusion times

15
Experiment Periodic Cell-Cycles in Yeast

How are the coefficients in X chosen?
Suppose microarray test h occurs at ?h in the
cell-cycle
Linear interpolation
Find the two fusion times on either side of ?h
Weight each one according to how close they are
to ?h
The other fusion times for h have a zero weight
How is this done? We dont know ?h !
Algorithm assumes initial weight values, then
iteratively updates according to resulting
generalization error
Authors claim convergence of these weights within
3 or 4 iterations, but continue through 10
iterations in their results for precision

See Gilks et al, Equation 7
See Gilks et al, Equation 6
16
Experiment Periodic Cell-Cycles in Yeast

Now use the DXCe model to estimate C
But how do we know the answer we get is correct?
Need a technique to visualize the results!

17
Singular Value Decomposition (SVD)

A technique in linear algebra
Commonly used to solve systems of linear
equations
Also used for linear least-squares problems, or
curve fitting
The authors use SVD to find the two eigenvectors
of a matrix which exhibit the highest variation
i.e. the most variable components of a matrix
not part of the actual model, just used for
visualization
Similar in purpose to PCA (Principal Components
Analysis), which identifies the components with
highest variance
For more information on SVD and PCA with
bioinformatics applications, see Wall et al.

18
Gilks et al, Figure 1
19
Closeup of experiment cdc25-1
Ten fusion times are evenly spaced at p/5
radian intervals in the cycle.
Gilks et al, Figure 2
20
Peppered Fried Egg Plot
? Fusion Times
? Fusion Times
Specific Genes
Specific Genes
Cell-Cycleness
Cell-Cycleness
Gene Density
Gene Density
Fusion times are evenly spaced at intervals of
p/5 radians. Longer arrows indicate more
variability in gene expression levels at this
fusion time.
The pepper represents the periodic activation
of particular genes. Larger radius from the
origin indicates more cell-cycle dependence.
The boundary of the yolk represents the average
radius from the origin of all genes, at each
point in the cell cycle.
The boundary of the egg white represents the
average gene density, at each point in the cell
cycle.
Gilks et al, Figure 4
21
Multivariate Regression Model

Possible difficulties with proposed algorithm?
Assumes linear relationships for simplicity of
algorithm
Note the linear interpolation in our choice of X
coefficients
Microarray tests which fail to cohere with the
generality of results will be downweighted
automatically, as part of the algorithm
In other words, the majority wins what if the
majority of experiments have been conducted
poorly?
Difference in coverage over cell-cycle
Some parts of the cell-cycle have many
contributors, others few
Treatment of missing data KNN (K Nearest
Neighbors)
However, these imputed data points have the
same weight in the algorithm as the measured data
points
Doesnt address some significant sources of
error, such as gene-specific dye bias
Most microarray experiments use the same dyes,
Cy3 and Cy5

22
Project Proposal

Can we use different methods to obtain similar
results?
SVM regression (Support Vector Machines)?
To model the ideal, noise-free microarray test at
any point in cycle
ICA (Independent Components Analysis)?
Identify contributions from n different cell
types and a noise component
Simulated Annealing (a stochastic optimization
method)?
Identify the best cell-cycle synchronization
points
Why SVM regression?
Ability to generalize from a low number of
samples
Detect non-linear relationships (paper assumes
linear!)
Why ICA?
Computationally complex, but requires no
assumptions about underlying data or noise models
(we dont need to know n!)

23
References

Primary Paper
W.R.Gilks, B.D.M.Tom, A.Brazma, Fusing
microarray experiments with multivariate
regression, Bioinformatics, 21(Suppl.
2)137143, 2005.
Experimental Procedures
G.Rustici, J.Mata, K.Kivinen, P.Lió, C.J.Penkett,
G.Burns, J.Hayles, A.Brazma, P.Nurse, J.Bähler,
Periodic gene expression program of the fission
yeast cell cycle, Nature Genetics,
36(8)809817, 2004.
Microarrays
Wikipedia Contributors, DNA microarray,
(http//en.wikipedia.org/wiki/CDNA_microarray),
2006.
A.M.Campbell, DNA microarray methodology Flash
animation, Department of Biology, Davidson
College, Davidson, NC, (http//www.bio.davidson.ed
u/Courses/genomics/chip/chipQ.html), 2001.
C.A.Orengo, D.T.Jones, J.M.Thornton,
Bioinformatics Genes, Proteins Computers, New
York Springer-Verlag, pp.218228, 2003.
Singular Value Decomposition (SVD)
W.H.Press, S.A.Teukolsky, W.T.Vetterling,
B.P.Flannery, Numerical Recipes in C The Art of
Scientific Computing, 2nd ed., Cambridge
University Press, pp.5970, 1992.
M.E.Wall, A.Rechtsteiner, L.M.Rocha."Singular
value decomposition and principal component
analysis". In A Practical Approach to Microarray
Data Analysis, D.P.Berrar, W.Dubitzky, M.Granzow,
eds., pp. 91109, Kluwer Norwell, MA, 2003.
Support Vector Machines (SVM)
K.P.Bennett, C.Campbell, Support vector
machines Hype or hallelujah? SIGKDD
Explorations, 2(2)113, 2000.
A.J.Smola, B.Schölkopf, A tutorial on support
vector regression, Statistics and Computing,
14(3)199222, 2004.
V.N.Vapnik, The Nature of Statistical Learning
Theory, 2nd ed., New York Springer-Verlag, 1999.
Independent Components Analysis (ICA)
A.Hyvärinen, Survey on independent components
analysis, Neural Computing Surveys, 294128,
1999.

24
Support Vector Machines

SVM use statistical machine learning
Constrained optimization problem
Objective Find a hyperplane which
maximizes margin
Higher dimensional mappings provide flexibility
Non-separable data a tradeoff to allow
misclassification some points in order to improve
generalization performance (cost parameter)
Non-linear SVM (Polynomial, Sigmoid, Gaussian
kernels)

25
Support Vector Machines

The importance of data normalization (centre and
scale)
The importance of free-parameter selection

Dataset from MLDB Iris Plant Database
26
e-Tube Support Vector Regression
Bennet et al, Figure 12

Can we use e-SVR for outlier detection?
i.e. identify contributing samples which are
outside the e boundary, remove them, and retrain
the model
Missing data can we include the number of
missing data points as another input variable for
the SVM model?

27
Independent Components Analysis

ICA attempts to find the true underlying signals
from multiple observations of a mix of signals
Finds signals which are as statistically
independent from one another as possible blind
source separation
Different to PCA, which identifies the measured
signals with highest variance
For example, consider a hypothetical political
debate
Martin and Harper are speaking at the same time
two omnidirectional microphones listening to both
speakers
ICA can isolate each speakers voice!
For a demo http//www2.ele.tue.nl/ica99/realworl
d.html

28
Independent Components Analysis Test
29
Independent Components Analysis Samples
30
Independent Components Analysis Results
31
Cell-cycle for Selected Genes
Gilks et al, Figure 5

Write a Comment

User Comments (0)