DNA Microarrays - PowerPoint PPT Presentation

About This Presentation

Title:

DNA Microarrays

Description:

'Spot' them on a medium, e.g. an ordinary glass microscope ... Spot localization and detection, including the extraction of the background ... – PowerPoint PPT presentation

Number of Views:181

Avg rating:3.0/5.0

Slides: 44

Provided by: patrick55

Learn more at: https://www.cse.lehigh.edu

Category:

more less

Transcript and Presenter's Notes

Title: DNA Microarrays

1
DNA Microarrays

Patrick Schmid
CSE 497
Spring 2004

2
What is a DNA Microarray?

Also known as DNA Chip
Allows simultaneous measurement of the level of
transcription for every gene in a genome (gene
expression)
Transcription?
Process of copying of DNA into messenger RNA
(mRNA)
Environment dependant!
Microarray detects mRNA, or rather the more
stable cDNA

3
What is a DNA Microarray? (cont.)
Cheung et al. 1999
4
How do we manufacture a microarray?

Start with individual genes, e.g. the 6,200
genes of the yeast genome
Amplify all of them using polymerase chain
reaction (PCR)
Spot them on a medium, e.g. an ordinary glass
microscope slide
Each spot is about 100 µm in diameter
Spotting is done by a robot
Complex and potentially expensive task

5
How do we manufacture a microarray?
Cheung et al. 1999
6
Example

Remember the flash animation?
Yeast
Grow in aerobic and anaerobic environment
Different genes will be activated in order to
adapt to each environment
Extract mRNA
Convert mRNA into colored cDNA (fluorescently
labeled)

7
Example (cont.)

Mix cDNA together
Hybridize cDNA with array
Each cDNA sequence hybridizes specifically with
the corresponding gene sequence in the array
Wash unhybridized cDNA off
Read array with laser
Analyze images

8
Overview of Example
Brown Botstein, 1999
9
Reading an array

Laser scans array and produces images
One laser for each color, e.g. one for green, one
for red
Image analysis, main tasks
Noise suppression
Spot localization and detection, including the
extraction of the background intensity, the spot
position, and the spot boundary and size
Data quantification and quality assessment
Image Analysis is a book on its own
Kamberova, G. Shah, S. DNA Array Image
Analysis Nuts Bolts. DNA Press LLC, 2002

10
Reading an array (cont.)
Block Column Row Gene Name Red Green RedGreen Ratio
1 1 1 tub1 2,345 2,467 0.95
1 1 2 tub2 3,589 2,158 1.66
1 1 3 sec1 4,109 1,469 2.80
1 1 4 sec2 1,500 3,589 0.42
1 1 5 sec3 1,246 1,258 0.99
1 1 6 act1 1,937 2,104 0.92
1 1 7 act2 2,561 1,562 1.64
1 1 8 fus1 2,962 3,012 0.98
1 1 9 idp2 3,585 1,209 2.97
1 1 10 idp1 2,796 1,005 2.78
1 1 11 idh1 2,170 4,245 0.51
1 1 12 idh2 1,896 2,996 0.63
1 1 13 erd1 1,023 3,354 0.31
1 1 14 erd2 1,698 2,896 0.59
Campbell Heyer, 2003
11
Real DNA Microarray
Campbell Heyer, 2003
12
Y-fold

Biologists rather deal with folds than with
ratios
A fold is nothing else than saying times
We express it either as a Y-fold repression, or a
Y-fold induction
It is calculated by taking the inverse of the
ratio
Ratio of 0.33 3-fold repression
Ratio of 10 10-fold induction
Fractional ratios can cause problems with
techniques of analyzing and comparing gene
expression patterns

13
Color Coding

Tables are difficult to read
Data is presented with a color scale
Coding scheme
Green repressed (less mRNA) gene in experiment
Red induced (more mRNA) gene in experiment
Black no change (11 ratio)
Or
Green control condition (e.g. aerobic)
Red experimental condition (e.g. anaerobic)
We only use ratio

Campbell Heyer, 2003
14
Logarithmic transformation

log2 is commonly used
Sometimes log10 is used
Example
log2(0.0625) log2(1/16) log2(1) log2(16)
-log2(16) -4
log2 transformations ease identification of
doublings or halvings in ratios
log10 transformations ease identification of
order of magnitude changes
Key attribute equally sized induction and
repression receive equal treatment visually and
mathematically

15
Complication Time Series

Biologists care more about the process of
adaptation than about the end result
For example, measure every 2 hours for 10 hours
(depletion of oxygen)
31,000 gene expression ratios
Or 6,200 different graphs with five data points
each
Question Are there any genes that responded in
similar ways to the depletion of oxygen?

16
Example data fold change (ratios)
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 1 8 12 16 12 8
Gene D 1 3 4 4 3 2
Gene E 1 4 8 8 8 8
Gene F 1 1 1 0.25 0.25 0.1
Gene G 1 2 3 4 3 2
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene I 1 4 8 4 1 0.5
Gene J 1 2 1 2 1 2
Gene K 1 1 1 1 3 3
Gene L 1 2 3 4 3 2
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
What is the pattern?
Campbell Heyer, 2003
17
Example data log2 transformation
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 0 3 3.58 4 3.58 3
Gene D 0 1.58 2 2 1.58 1
Gene E 0 2 3 3 3 3
Gene F 0 0 0 -2 -2 -3.32
Gene G 0 1 1.58 2 1.58 1
Gene H 0 -1 -1.60 -2 -1.60 -1
Gene I 0 2 3 2 0 -1
Gene J 0 1 0 1 0 1
Gene K 0 0 0 0 1.58 1.58
Gene L 0 1 1.58 2 1.58 1
Gene M 0 -1.60 -2 -2 -1.60 -1
Gene N 0 -3 -3.59 -4 -3.59 -3
Campbell Heyer, 2003
18
Pearson Correlation Coefficient r

Gene expression over time is a vector, e.g. for
gene C (0, 3, 3.58, 4, 3.58, 3)
Given two vectors X and Y that contain N
elements, we calculate r as follows

Cho Won, 2003
19
Pearson Correlation Coefficient r (cont.)

X Gene C (0, 3.00, 3.58, 4, 3.58, 3)Y Gene
D (0, 1.58, 2.00, 2, 1.58, 1)
?XY (0)(0)(3)(1.58)(3.58)(2)(4)(2)(3.58)(1.5
8)(3)(1) 28.5564
?X 33.5843.583 17.16
?X2 323.582423.58232 59.6328
?Y 1.58221.581 8.16
?Y2 1.58222221.58212 13.9928
N 6
?XY ?X?Y/N 28.5564 (17.16)(8.16)/6 5.2188
?X2 (?X)2/N 59.6328 (17.16)2/6 10.5552
?Y2 (?Y)2/N 13.9928 (8.16)2/6 2.8952
r 5.2188 / sqrt((10.5552)(2.8952)) 0.944

20
Example data Pearson correlation coefficient
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 1 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.94 1 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E 0.96 0.84 1 -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.40 -0.10 -0.57 1 -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.95 -0.94 -0.89 0.35 -1 1 -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0.41 0.68 0.21 0.60 0.48 -0.48 1 0 -0.75 0.48 -0.68 -0.41
Gene J 0.36 0.24 0.30 -0.43 0.22 -0.21 0 1 0 0.22 -0.24 -0.36
Gene K 0.23 -0.07 0.43 -0.79 0.11 -0.11 -0.75 0 1 0.11 0.07 -0.23
Gene L 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene M -0.94 -1 -0.84 0.10 -0.94 0.94 -0.68 -0.24 0.07 -0.94 1 0.94
Gene N -1 -0.94 -0.96 0.40 -0.95 0.95 -0.41 -0.36 -0.23 -0.95 0.94 1
Campbell Heyer, 2003
21
Example Reorganization of data
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene K 1 1 1 1 3 3
Gene J 1 2 1 2 1 2
Gene E 1 4 8 8 8 8
Gene C 1 8 12 16 12 8
Gene L 1 2 3 4 3 2
Gene G 1 2 3 4 3 2
Gene D 1 3 4 4 3 2
Gene I 1 4 8 4 1 0.5
Gene F 1 1 1 0.25 0.25 0.1
Campbell Heyer, 2003
22
Clustering of example
Campbell Heyer, 2003
23
Clustering of entire yeast genome
Campbell Heyer, 2003
24
Hierarchical Clustering

Algorithm
First, find the two most similar genes in the
entire set of genes. Join these together into a
cluster. Now join the next two most similar
objects (an object can be a gene or a cluster),
forming a new cluster. Add the new cluster to the
list of available objects, and remove the two
objects used to form the new cluster. Continue
this process, joining objects in the order of
their similarity to one another, until there is
only one object on the list a single cluster
containing all genes. (Campbell
Heyer, 2003)

25
Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0 -0.75 0.48 -0.68 -0.41
Gene J 0 0.22 -0.24 -0.36
Gene K 0.11 0.07 -0.23
Gene L -0.94 -0.95
Gene M 0.94
Gene N
Campbell Heyer, 2003
26
Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G
Gene C 0.94 0.96 -0.40 0.95
Gene D 0.84 -0.10 0.94
Gene E -0.57 0.89
Gene F -0.35
Gene G
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
C
D
E

Average observations
Gene D (0.940.84)/2 0.89
Gene F (-0.40(-0.57))/2 -0.485
Gene G (0.950.89)/2 0.92

F
1
G
C
E
27
Hierarchical Clustering (cont.)
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
D
F
1
2
G
C
E
G
D
28
Hierarchical Clustering (cont.)
1 2 Gene F
1 0.905 -0.485
2 -0.225
Gene F
3
F
1
2
C
E
G
D
29
Hierarchical Clustering (cont.)
3 Gene F
3 -0.355
Gene F
4
3
F
1
2
F
C
E
G
D
30
Hierarchical Clustering (cont.)
Did this algorithm not look familiar?
4
Remember Neighbor-Joining?
3
1
2
F
C
E
G
D
31
Hierarchical Clustering (cont.)
Eisen et al., 1998
32
Hierarchical Clustering (cont.)

We differentiate hierarchical clustering
algorithms by how they agglomerate distances

Single Linkage
Shortest link between two clusters
Complete Linkage
Longest link between two clusters
Average Linkage
Average of distances between all pairs of objects
Average Group Linkage
Groups once formed are represented by their mean
values, and then those are averaged
Which one did we use in the previous example ?

http//www.resample.com/xlminer/help/HClst/HClst_i
ntro.htm
33
Clustering Overview

Different similarity measures
Pearson Correlation Coefficient
Cosine Coefficient
Euclidean Distance
Information Gain
Mutual Information
Signal to noise ratio
Simple Matching for Nominals

34
Clustering Overview (cont.)

Different Clustering Methods
Unsupervised
Hierarchical Clustering
k-means Clustering (k nearest neighbors)
Thursday
Self-organizing map
Thursday
Supervised
Support vector machine
Ensemble classifier
Data Mining

35
Support Vector Machines

Linear regression
x w0 w1a1 w2a2 wkak
x is the class, ai are the attribute values and
wj are the weights
Given a distance vector Y with distances ai in
which class x does Y belong?
What do we mean by a class x?
Primitive method Y is in one class if xlt0.5, in
another class for x0.5.

36
Support Vector Machines (cont.)

Multi-response linear regression
Set output to 1 for training instances that
belong to a class
Set output to 0 for training instances that do
not belong to that class
Result is a linear expression for each class
Classification of unknown example
Compute all linear expressions
Choose the one that gives the largest output value

37
Support Vector Machines (cont.)

This means
Two pairs of classes
Weight vector for class 1
w0(1) w1(1)a1 w2(1)a2 wk(1)ak
Weight vector for class 2
w0(2) w1(2)a1 w2(2)a2 wk(2)ak
An instance will be assigned to class 1 rather
than class 2 if
w0(1) w1(1)a1 w2(1)a2 wk(1)ak gt w0(2)
w1(2)a1 w2(2)a2 wk(2)ak
We can rewrite this as
(w0(1) - w0(2)) (w1(1) - w1(2)) a1 (wk(1)
- wk(2)) ak gt 0
Hyperplane

38
Support Vector Machines (cont.)

We can only represent linear boundaries between
classes so far
Trick Transform the input using a nonlinear
mapping, then construct a linear model in the new
space
Example Use all products of n factors (2
attributes, n3)
x w1a13 w2a12a2 w3a1a22 w4a23
Then use multi-response linear regression
However, for 10 attributes and including all
products with 5 factors, we would need to
determine more than 2000 coefficients
Linear regression is O(n3) in time
Problem Training is infeasible
Another problem Overfit. The resulting model
will be too nonlinear, because there are just
too many parameters in the model.

39
Support Vector Machines (cont.)

Convex hull of points is the tightest enclosing
polygon
Maximum margin hyperplane
Instances closest to hyperplane are called
support vectors
Support vectors define maximum margin hyperplane
uniquely

support vectors
Witten Frank, 2000
40
Support Vector Machines (cont.)

We only need set of support vectors, everything
else is irrelevant
A hyperplane separating two classes can then be
written as
x w0 w1a1 w2a2
Or
x b ? ai?i (a(i) a)
i is support vector
?i is the class value of a(i)
b and ai are numeric values to be determined
Vector a represents a test instance
a(i) are the support vectors
Determining b and ai is a constrained quadratic
optimization problem that can be solved with
off-the-shelf software packages
Support Vector Machines do not overfit, because
there are usually only a few support vectors

41
Support Vector Machines (cont.)

Did I not introduce Support Vector Machines by
talking about non-linear class boundaries?
x b ? ai?i (a(i) a)n
n is the number of factors
(x y)n is called a polynomial kernel
A good way of choosing n is by starting with n1
and incrementing it until estimated error ceases
to improve
If you want to know more
SVMs in general Witten Frank, 2000 (lecture
material based on this)
Application to cancer classification Cho Won,
2003

42
Demo Shneiderman
43
References

Brown, P., Botstein, D. Exploring the new world
of the genome with DNA microarrays Nature
genetics supplement, vol. 21, January 1999
Campbell A. Heyer L. discovering Genomics,
Proteomics, Bioninformatics Benjamin Cummings,
2003.
Cheung, V., Morley, M., Aguilar, F., Massimi, A.,
Kucherlapati, R. Childs, G. Making and reading
microarrays Nature genetics supplement, vol. 21,
January 1999
Cho, S. Won, H. Machine Learning in DNA
Microarray Analysis for Cancer Classification
Proceedings of the First Asia-Pacific
bioinformatics conference on Bioinformatics 2003
- Volume 19, Australian Computer Society Inc.
Eisen, M., Spellman, P., Brown, P. Botstein, D.
Cluster analysis and display of genome-wide
expression patterns Proc. Natl. Acad. Sci. USA.
Vol 95, pp. 14 863-14868, December 1998. Genetics
Seo, J. Sheiderman, B. Interactively Exploring
Hierarchical Clustering Results IEEE Computer,
July 2002
Witten, I. Frank, E. Data Mining Morgan
Kaufmann Publishers, 2000