Title: L16: Micro-array analysis
1L16 Micro-array analysis
- Dimension reduction
- Unsupervised clustering
2PCA motivating example
- Consider the expression values of 2 genes over 6
samples. - Clearly, the expression of g1 is not informative,
and it suffices to look at g2 values. - Dimensionality can be reduced by discarding the
gene g1
g1
g2
3PCA Ex2
- Consider the expression values of 2 genes over 6
samples. - Clearly, the expression of the two genes is
highly correlated. - Projecting all the genes on a single line could
explain most of the data.
4PCA
- Suppose all of the data were to be reduced by
projecting to a single line ? from the mean. - How do we select the line ??
?
m
5PCA contd
- Let each point xk map to xkmak?. We want to
minimize the error - Observation 1 Each point xk maps to xk m
?T(xk-m)? - (ak ?T(xk-m))
xk
?
xk
m
6Proof of Observation 1
Differentiating w.r.t ak
7Minimizing PCA Error
- To minimize error, we must maximize ?TS?
- By definition, ? ?TS? implies that ? is an
eigenvalue, and ? the corresponding eigenvector. - Therefore, we must choose the eigenvector
corresponding to the largest eigenvalue.
8PCA
- The single best dimension is given by the
eigenvector of the largest eigenvalue of S - The best k dimensions can be obtained by the
eigenvectors ?1, ?2, , ?k corresponding to the
k largest eigenvalues. - To obtain the k dimensional surface, take BTM
?1T
BT
M
9Clustering
- Suppose we are not given any classes.
- Instead, we are asked to partition the samples
into clusters that make sense. - Alternatively, partition genes into clusters.
- Clustering is part of unsupervised learning
10Microarray Data
- Microarray data are usually transformed into an
intensity matrix (below) - The intensity matrix allows biologists to make
correlations between different genes (even if
they are - dissimilar) and to understand how genes
functions might be related - Clustering comes into play
Time 1 Time i Time N
Gene 1 10 8 10
Gene 2 10 0 9
Gene 3 4 8.6 3
Gene 4 7 8 3
Gene 5 1 2 3
Intensity (expression level) of gene at measured
time
11Clustering of Microarray Data
- Plot each gene as a point in N-dimensional space
- Make a distance matrix for the distance between
every two gene points in the N-dimensional space - Genes with a small distance share the same
expression characteristics and might be
functionally related or similar - Clustering reveals groups of functionally related
genes
12Graphing the intensity matrix inmulti-dimensional
space
13The Distance Matrix, d
14Homogeneity and Separation Principles
- Homogeneity Elements within a cluster are close
to each other - Separation Elements in different clusters are
further apart from each other - clustering is not an easy task!
Given these points a clustering algorithm might
make two distinct clusters as follows
15Bad Clustering
This clustering violates both Homogeneity and
Separation principles
Close distances from points in separate clusters
Far distances from points in the same cluster
16Good Clustering
This clustering satisfies both Homogeneity and
Separation principles
17Clustering Techniques
- Agglomerative Start with every element in its
own cluster, and iteratively join clusters
together - Divisive Start with one cluster and iteratively
divide it into smaller clusters - Hierarchical Organize elements into a tree,
leaves represent genes and the length of the
paths between leaves represents the distances
between genes. Similar genes lie within the same
subtrees.
18Hierarchical Clustering
- Initially, each element is its own cluster
- Merge the two closest clusters, and recurse
- Key question What is closest?
- How do you compute the distance between clusters?
19Hierarchical Clustering Computing Distances
- dmin(C, C) min d(x,y) for all
elements x in C and y in C - Distance between two clusters is the smallest
distance between any pair of their elements - davg(C, C) (1 / CC) ? d(x,y) for all
elements x in C - and y in C
- Distance between two clusters is the average
distance between all pairs of their elements
20Computing Distances (continued)
- However, we still need a base distance metric
for pairs of gene - Euclidean distance
- Manhattan distance
- Dot Product
- Mutual information
What are some qualitative differences between
these?
21Geometrical interpretation of distances
- The distance measures are all related.
- In some cases, the magnitude of the vector is
important, in other cases it is not.
22Comparison between metrics
- Euclidean and Manhattan tend to perform similarly
and emphasize the overall magnitude of
expression. - The dot-product is very useful if the shape of
the expression vector is more important than its
magnitude. - The above metrics are less useful for identifying
genes for which the expression levels are
anti-correlated. One might imagine an instance in
which the same transcription factor can cause
both enhancement and repression of expression.
In this case, the squared correlation (r2) or
mutual information is sometimes used.
23(No Transcript)
24But how many orderings can we have?
1
2
4
5
3
25- For n leaves there are n-1 internal nodes
- Each flip in an internal node creates a new
linear ordering of the leaves - There are therefore 2n-1 orderings
E.g., flip this node
1
2
4
5
3
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30Bar-Joseph et al. Bioinformatics (2001)
31Computing an Optimal Ordering
- Define LT(u,v) as the optimum score of all
orderings for the subtree rooted at T where - u is the left node, and
- v is the right node
- Is it sufficient to compute LT(u,v) for all T,u,v
?
T
u
v
32T
T1
T2
v
m
k
u
LT(u,v) max k,m LT1(u,k) LT2(u,m)
33Time complexity of the algorithm?
T
- The recursion LT(u,w) is applied for each T,u,v.
Each recursion takes O(n2) time.
- Each pair of nodes has a unique Least common
ancestor. - LT(u,w) only needs to be computed if LCA(u,w)
T - Total time O(n4)
w
u
34Speed Improvements
- For all m in LT1(u,R)
- If LT1(u,m)LT2(k0,w) C(T1,T2) lt CurrMax
- Exit loop
- For all k in LT1(w,L)
- If LT1(u,m)LT2(k,w)C(T1,T2) lt CurrMax
- Exit loop
- Else recompute CurrMax.
- In practice, this leads to great speed
improvements - 1500 genes, 7 hrs. changes to 7 min.
35(No Transcript)
36(No Transcript)