L16: Micro-array analysis

About This Presentation

Title:

L16: Micro-array analysis

Description:

Let each point xk map to x'k=m ak . We want to minimize the error ... Euclidean and Manhattan tend to perform similarly and emphasize the overall ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 37

Provided by: vineet50

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: L16: Micro-array analysis

1
L16 Micro-array analysis

Dimension reduction
Unsupervised clustering

2
PCA motivating example

Consider the expression values of 2 genes over 6
samples.
Clearly, the expression of g1 is not informative,
and it suffices to look at g2 values.
Dimensionality can be reduced by discarding the
gene g1

g1
g2
3
PCA Ex2

Consider the expression values of 2 genes over 6
samples.
Clearly, the expression of the two genes is
highly correlated.
Projecting all the genes on a single line could
explain most of the data.

4
PCA

Suppose all of the data were to be reduced by
projecting to a single line ? from the mean.
How do we select the line ??

?
m
5
PCA contd

Let each point xk map to xkmak?. We want to
minimize the error
Observation 1 Each point xk maps to xk m
?T(xk-m)?
(ak ?T(xk-m))

xk
?
xk
m
6
Proof of Observation 1
Differentiating w.r.t ak
7
Minimizing PCA Error

To minimize error, we must maximize ?TS?
By definition, ? ?TS? implies that ? is an
eigenvalue, and ? the corresponding eigenvector.
Therefore, we must choose the eigenvector
corresponding to the largest eigenvalue.

8
PCA

The single best dimension is given by the
eigenvector of the largest eigenvalue of S
The best k dimensions can be obtained by the
eigenvectors ?1, ?2, , ?k corresponding to the
k largest eigenvalues.
To obtain the k dimensional surface, take BTM

?1T
BT
M
9
Clustering

Suppose we are not given any classes.
Instead, we are asked to partition the samples
into clusters that make sense.
Alternatively, partition genes into clusters.
Clustering is part of unsupervised learning

10
Microarray Data

Microarray data are usually transformed into an
intensity matrix (below)
The intensity matrix allows biologists to make
correlations between different genes (even if
they are
dissimilar) and to understand how genes
functions might be related
Clustering comes into play

Time 1 Time i Time N
Gene 1 10 8 10
Gene 2 10 0 9
Gene 3 4 8.6 3
Gene 4 7 8 3
Gene 5 1 2 3
Intensity (expression level) of gene at measured
time
11
Clustering of Microarray Data

Plot each gene as a point in N-dimensional space
Make a distance matrix for the distance between
every two gene points in the N-dimensional space
Genes with a small distance share the same
expression characteristics and might be
functionally related or similar
Clustering reveals groups of functionally related
genes

12
Graphing the intensity matrix inmulti-dimensional
space
13
The Distance Matrix, d
14
Homogeneity and Separation Principles

Homogeneity Elements within a cluster are close
to each other
Separation Elements in different clusters are
further apart from each other
clustering is not an easy task!

Given these points a clustering algorithm might
make two distinct clusters as follows
15
Bad Clustering
This clustering violates both Homogeneity and
Separation principles
Close distances from points in separate clusters
Far distances from points in the same cluster
16
Good Clustering
This clustering satisfies both Homogeneity and
Separation principles
17
Clustering Techniques

Agglomerative Start with every element in its
own cluster, and iteratively join clusters
together
Divisive Start with one cluster and iteratively
divide it into smaller clusters
Hierarchical Organize elements into a tree,
leaves represent genes and the length of the
paths between leaves represents the distances
between genes. Similar genes lie within the same
subtrees.

18
Hierarchical Clustering

Initially, each element is its own cluster
Merge the two closest clusters, and recurse
Key question What is closest?
How do you compute the distance between clusters?

19
Hierarchical Clustering Computing Distances

dmin(C, C) min d(x,y) for all
elements x in C and y in C
Distance between two clusters is the smallest
distance between any pair of their elements
davg(C, C) (1 / CC) ? d(x,y) for all
elements x in C
and y in C
Distance between two clusters is the average
distance between all pairs of their elements

20
Computing Distances (continued)

However, we still need a base distance metric
for pairs of gene
Euclidean distance
Manhattan distance
Dot Product
Mutual information

What are some qualitative differences between
these?
21
Geometrical interpretation of distances

The distance measures are all related.
In some cases, the magnitude of the vector is
important, in other cases it is not.

22
Comparison between metrics

Euclidean and Manhattan tend to perform similarly
and emphasize the overall magnitude of
expression.
The dot-product is very useful if the shape of
the expression vector is more important than its
magnitude.
The above metrics are less useful for identifying
genes for which the expression levels are
anti-correlated. One might imagine an instance in
which the same transcription factor can cause
both enhancement and repression of expression.
In this case, the squared correlation (r2) or
mutual information is sometimes used.

23
(No Transcript)
24
But how many orderings can we have?
1
2
4
5
3
25

For n leaves there are n-1 internal nodes
Each flip in an internal node creates a new
linear ordering of the leaves
There are therefore 2n-1 orderings

E.g., flip this node
1
2
4
5
3
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Bar-Joseph et al. Bioinformatics (2001)
31
Computing an Optimal Ordering

Define LT(u,v) as the optimum score of all
orderings for the subtree rooted at T where
u is the left node, and
v is the right node
Is it sufficient to compute LT(u,v) for all T,u,v
?