Fusing Results from Microarray Experiments - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Fusing Results from Microarray Experiments

Description:

A microarray is a collection of thousands of small test locations, arranged in a ... Now use the D=XCe model to estimate C. But how do we know the answer we get ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 32
Provided by: matthewd9
Category:

less

Transcript and Presenter's Notes

Title: Fusing Results from Microarray Experiments


1
Fusing Results from Microarray Experiments
  • Matt.Boardman_at_dal.ca
  • http//www.cs.dal.ca/boardman

2
Summary
  • Primary paper
  • Gilks et al, Fusing microarray experiments with
    multivariate regression, Bioinformatics, 2005.
  • The basic idea
  • Microarray experiments are subject to noise and
    variation
  • Regression model fuses data from several
    microarray tests
  • Unique visualization!
  • Application
  • Rustici et al, Periodic gene expression program
    of the fission yeast cell cycle, Nature
    Genetics, 2004.

3
Agenda
  • Introduction to Microarrays
  • Regression Model
  • Experimental Procedures
  • Visualization of Results
  • Project Proposal

4
DNA Transcription to mRNA
Orengo et al, Figure 1.1
5
DNA Microarrays
  • A microarray is a collection of thousands of
    small test locations, arranged in a 1 x 3
    array.
  • Each test location has a small fragment of DNA,
    called a probe (about 20-70 bases), which
    corresponds to a particular gene.
  • Fragments of mRNA (recently transcribed messenger
    RNA) from a test subject bind to each probe.
  • We measure the quantity of mRNA that sticks to
    each probe, to determine how much mRNA for that
    gene is present in the sample.

http//www.agilent.com/about/newsroom/lsca/imageli
brary/index_2003.html
6
DNA Microarrays
  • Flash demo
  • http//www.bio.davidson.edu/Courses/genomics/chi
    p/chipQ.html

7
DNA Microarrays
  • Slide Manufacturers
  • Agilent (HP spinoff)
  • Amersham Codelink
  • Corning CMT GAPS II
  • Erie Sciences Gold Seal
  • Scanner Manufacturers
  • Affymetrix
  • Agilent
  • Applied Precision
  • Asper Biotech
  • Axon
  • Molecular Devices
  • National Instruments
  • Vidar
  • Commercial Software
  • Axon/GenePix
  • GeneExplorer
  • Iobion-Stratagene/GeneTraffic
  • Rosetta Resolver
  • Spotfire
  • SGI/GeneSpring

http//microarrays.ucsd.edu/biogem/resources/image
s/agilent_scanner.jpg
8
DNA Microarrays
  • Sources of error
  • gene-specific dye bias
  • probe design and manufacturing
  • heterogeneity in source material (The Fly!)
  • glass surface abnormalities (warpage, curvature)
  • variations in glass thickness
  • slide movement within scanner
  • slide manufacturing quality
  • mRNA deterioration
  • Remedies
  • daily calibration
  • dynamic autofocus (Agilent)
  • software fixes (e.g. normalization)
  • repeat, repeat, repeat

http//www.moleculardevices.com/pages/instruments/
gn_genepix4000.html
9
Multivariate Regression Model
  • Microarray test repetition
  • Different laboratories
  • Different slides / scanners / software
  • Different procedures for sample preparation
  • Authors propose a new model to combine data from
    multiple microarray tests
  • No need to infer the causes of error
  • Automatically filter out noise and artefacts
  • Iteratively weight each test based on quality of
    results
  • Avoid polluting high-quality results with lower
    quality data
  • Deliver fused and cleaned dataset for further
    analysis

10
Multivariate Regression Model
  • Let
  • N be the number of microarray tests
  • m be the number of genes in each microarray
  • n be the number of hypothetical cell types under
    test
  • Note
  • Typically m N
  • We dont know n, but we assume n lt N

11
Multivariate Regression Model
Gilks et al, Equation 1
  • Where
  • D is the matrix of observations the actual
    microarray tests
  • X is a matrix of weights, uniquely designed for
    each experiment
  • C is the ideal, perfect microarray test with no
    variation or noise
  • e contains unknown residual errors and noise
  • So
  • D are the warped, noisy observations of the
    perfect microarray test C

12
Experiment Periodic Cell-Cycles in Yeast
  • Question
  • Which genes are involved in cell reproduction in
    yeast?
  • Schizosaccharomyces pombe (fission yeast)
  • Nine experiments were designed in order to
    synchronize the cell cycles in yeast
  • centrifugal elutriation
  • cdc25 block-release
  • combinations of both methods
  • Microarrays taken every 15 minutes, for roughly
    two cell cycles (about 5.5 hours)

13
Experiment Periodic Cell-Cycles in Yeast
  • Goal
  • Fuse these nine different experiments into one
    ideal
  • Result will be a set of microarray results for
    one cell-cycle
  • Problem
  • Different synchronization methods ? different
    cell-cycles
  • Experiments are not exactly in phase with each
    other
  • Experiments result in different cell-cycle
    lengths

14
Experiment Periodic Cell-Cycles in Yeast
  • These nine experiments produce N178 microarray
    tests
  • Each microarray test has m407 genes
  • Selected since they are identified as periodic in
    cell-cycle
  • 136 of these show significant changes during
    cycle
  • Define an ideal cell-cycle, divided into n10
    fusion times
  • Each microarray test will be at a different
    angle in the ideal cycle
  • The coefficients in X are chosen to weight the
    relevance of each microarray test to each of the
    fusion times

15
Experiment Periodic Cell-Cycles in Yeast
  • How are the coefficients in X chosen?
  • Suppose microarray test h occurs at ?h in the
    cell-cycle
  • Linear interpolation
  • Find the two fusion times on either side of ?h
  • Weight each one according to how close they are
    to ?h
  • The other fusion times for h have a zero weight
  • How is this done? We dont know ?h !
  • Algorithm assumes initial weight values, then
    iteratively updates according to resulting
    generalization error
  • Authors claim convergence of these weights within
    3 or 4 iterations, but continue through 10
    iterations in their results for precision

See Gilks et al, Equation 7
See Gilks et al, Equation 6
16
Experiment Periodic Cell-Cycles in Yeast
  • Now use the DXCe model to estimate C
  • But how do we know the answer we get is correct?
  • Need a technique to visualize the results!

17
Singular Value Decomposition (SVD)
  • A technique in linear algebra
  • Commonly used to solve systems of linear
    equations
  • Also used for linear least-squares problems, or
    curve fitting
  • The authors use SVD to find the two eigenvectors
    of a matrix which exhibit the highest variation
  • i.e. the most variable components of a matrix
  • not part of the actual model, just used for
    visualization
  • Similar in purpose to PCA (Principal Components
    Analysis), which identifies the components with
    highest variance
  • For more information on SVD and PCA with
    bioinformatics applications, see Wall et al.

18
Gilks et al, Figure 1
19
Closeup of experiment cdc25-1
Ten fusion times are evenly spaced at p/5
radian intervals in the cycle.
Gilks et al, Figure 2
20
Peppered Fried Egg Plot
? Fusion Times
? Fusion Times
Specific Genes
Specific Genes
Cell-Cycleness
Cell-Cycleness
Gene Density
Gene Density
Fusion times are evenly spaced at intervals of
p/5 radians. Longer arrows indicate more
variability in gene expression levels at this
fusion time.
The pepper represents the periodic activation
of particular genes. Larger radius from the
origin indicates more cell-cycle dependence.
The boundary of the yolk represents the average
radius from the origin of all genes, at each
point in the cell cycle.
The boundary of the egg white represents the
average gene density, at each point in the cell
cycle.
Gilks et al, Figure 4
21
Multivariate Regression Model
  • Possible difficulties with proposed algorithm?
  • Assumes linear relationships for simplicity of
    algorithm
  • Note the linear interpolation in our choice of X
    coefficients
  • Microarray tests which fail to cohere with the
    generality of results will be downweighted
    automatically, as part of the algorithm
  • In other words, the majority wins what if the
    majority of experiments have been conducted
    poorly?
  • Difference in coverage over cell-cycle
  • Some parts of the cell-cycle have many
    contributors, others few
  • Treatment of missing data KNN (K Nearest
    Neighbors)
  • However, these imputed data points have the
    same weight in the algorithm as the measured data
    points
  • Doesnt address some significant sources of
    error, such as gene-specific dye bias
  • Most microarray experiments use the same dyes,
    Cy3 and Cy5

22
Project Proposal
  • Can we use different methods to obtain similar
    results?
  • SVM regression (Support Vector Machines)?
  • To model the ideal, noise-free microarray test at
    any point in cycle
  • ICA (Independent Components Analysis)?
  • Identify contributions from n different cell
    types and a noise component
  • Simulated Annealing (a stochastic optimization
    method)?
  • Identify the best cell-cycle synchronization
    points
  • Why SVM regression?
  • Ability to generalize from a low number of
    samples
  • Detect non-linear relationships (paper assumes
    linear!)
  • Why ICA?
  • Computationally complex, but requires no
    assumptions about underlying data or noise models
    (we dont need to know n!)

23
References
  • Primary Paper
  • W.R.Gilks, B.D.M.Tom, A.Brazma, Fusing
    microarray experiments with multivariate
    regression, Bioinformatics, 21(Suppl.
    2)137143, 2005.
  • Experimental Procedures
  • G.Rustici, J.Mata, K.Kivinen, P.Lió, C.J.Penkett,
    G.Burns, J.Hayles, A.Brazma, P.Nurse, J.Bähler,
    Periodic gene expression program of the fission
    yeast cell cycle, Nature Genetics,
    36(8)809817, 2004.
  • Microarrays
  • Wikipedia Contributors, DNA microarray,
    (http//en.wikipedia.org/wiki/CDNA_microarray),
    2006.
  • A.M.Campbell, DNA microarray methodology Flash
    animation, Department of Biology, Davidson
    College, Davidson, NC, (http//www.bio.davidson.ed
    u/Courses/genomics/chip/chipQ.html), 2001.
  • C.A.Orengo, D.T.Jones, J.M.Thornton,
    Bioinformatics Genes, Proteins Computers, New
    York Springer-Verlag, pp.218228, 2003.
  • Singular Value Decomposition (SVD)
  • W.H.Press, S.A.Teukolsky, W.T.Vetterling,
    B.P.Flannery, Numerical Recipes in C The Art of
    Scientific Computing, 2nd ed., Cambridge
    University Press, pp.5970, 1992.
  • M.E.Wall, A.Rechtsteiner, L.M.Rocha."Singular
    value decomposition and principal component
    analysis". In A Practical Approach to Microarray
    Data Analysis, D.P.Berrar, W.Dubitzky, M.Granzow,
    eds., pp. 91109, Kluwer Norwell, MA, 2003.
  • Support Vector Machines (SVM)
  • K.P.Bennett, C.Campbell, Support vector
    machines Hype or hallelujah? SIGKDD
    Explorations, 2(2)113, 2000.
  • A.J.Smola, B.Schölkopf, A tutorial on support
    vector regression, Statistics and Computing,
    14(3)199222, 2004.
  • V.N.Vapnik, The Nature of Statistical Learning
    Theory, 2nd ed., New York Springer-Verlag, 1999.
  • Independent Components Analysis (ICA)
  • A.Hyvärinen, Survey on independent components
    analysis, Neural Computing Surveys, 294128,
    1999.

24
Support Vector Machines
  • SVM use statistical machine learning
  • Constrained optimization problem
  • Objective Find a hyperplane which
    maximizes margin
  • Higher dimensional mappings provide flexibility
  • Non-separable data a tradeoff to allow
    misclassification some points in order to improve
    generalization performance (cost parameter)
  • Non-linear SVM (Polynomial, Sigmoid, Gaussian
    kernels)

25
Support Vector Machines
  • The importance of data normalization (centre and
    scale)
  • The importance of free-parameter selection

Dataset from MLDB Iris Plant Database
26
e-Tube Support Vector Regression
Bennet et al, Figure 12
  • Can we use e-SVR for outlier detection?
  • i.e. identify contributing samples which are
    outside the e boundary, remove them, and retrain
    the model
  • Missing data can we include the number of
    missing data points as another input variable for
    the SVM model?

27
Independent Components Analysis
  • ICA attempts to find the true underlying signals
    from multiple observations of a mix of signals
  • Finds signals which are as statistically
    independent from one another as possible blind
    source separation
  • Different to PCA, which identifies the measured
    signals with highest variance
  • For example, consider a hypothetical political
    debate
  • Martin and Harper are speaking at the same time
  • two omnidirectional microphones listening to both
    speakers
  • ICA can isolate each speakers voice!
  • For a demo http//www2.ele.tue.nl/ica99/realworl
    d.html

28
Independent Components Analysis Test
29
Independent Components Analysis Samples
30
Independent Components Analysis Results
31
Cell-cycle for Selected Genes
Gilks et al, Figure 5
Write a Comment
User Comments (0)
About PowerShow.com