L16: Micro-array analysis - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

L16: Micro-array analysis

Description:

Let each point xk map to x'k=m ak . We want to minimize the error ... Euclidean and Manhattan tend to perform similarly and emphasize the overall ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 37
Provided by: vineet50
Learn more at: http://cseweb.ucsd.edu
Category:
Tags: analysis | array | l16 | manhattan | map | micro

less

Transcript and Presenter's Notes

Title: L16: Micro-array analysis


1
L16 Micro-array analysis
  • Dimension reduction
  • Unsupervised clustering

2
PCA motivating example
  • Consider the expression values of 2 genes over 6
    samples.
  • Clearly, the expression of g1 is not informative,
    and it suffices to look at g2 values.
  • Dimensionality can be reduced by discarding the
    gene g1

g1
g2
3
PCA Ex2
  • Consider the expression values of 2 genes over 6
    samples.
  • Clearly, the expression of the two genes is
    highly correlated.
  • Projecting all the genes on a single line could
    explain most of the data.

4
PCA
  • Suppose all of the data were to be reduced by
    projecting to a single line ? from the mean.
  • How do we select the line ??

?
m
5
PCA contd
  • Let each point xk map to xkmak?. We want to
    minimize the error
  • Observation 1 Each point xk maps to xk m
    ?T(xk-m)?
  • (ak ?T(xk-m))

xk
?
xk
m
6
Proof of Observation 1
Differentiating w.r.t ak
7
Minimizing PCA Error
  • To minimize error, we must maximize ?TS?
  • By definition, ? ?TS? implies that ? is an
    eigenvalue, and ? the corresponding eigenvector.
  • Therefore, we must choose the eigenvector
    corresponding to the largest eigenvalue.

8
PCA
  • The single best dimension is given by the
    eigenvector of the largest eigenvalue of S
  • The best k dimensions can be obtained by the
    eigenvectors ?1, ?2, , ?k corresponding to the
    k largest eigenvalues.
  • To obtain the k dimensional surface, take BTM

?1T
BT
M
9
Clustering
  • Suppose we are not given any classes.
  • Instead, we are asked to partition the samples
    into clusters that make sense.
  • Alternatively, partition genes into clusters.
  • Clustering is part of unsupervised learning

10
Microarray Data
  • Microarray data are usually transformed into an
    intensity matrix (below)
  • The intensity matrix allows biologists to make
    correlations between different genes (even if
    they are
  • dissimilar) and to understand how genes
    functions might be related
  • Clustering comes into play



Time 1 Time i Time N
Gene 1 10 8 10
Gene 2 10 0 9
Gene 3 4 8.6 3
Gene 4 7 8 3
Gene 5 1 2 3
Intensity (expression level) of gene at measured
time
11
Clustering of Microarray Data
  • Plot each gene as a point in N-dimensional space
  • Make a distance matrix for the distance between
    every two gene points in the N-dimensional space
  • Genes with a small distance share the same
    expression characteristics and might be
    functionally related or similar
  • Clustering reveals groups of functionally related
    genes

12
Graphing the intensity matrix inmulti-dimensional
space
13
The Distance Matrix, d
14
Homogeneity and Separation Principles
  • Homogeneity Elements within a cluster are close
    to each other
  • Separation Elements in different clusters are
    further apart from each other
  • clustering is not an easy task!

Given these points a clustering algorithm might
make two distinct clusters as follows
15
Bad Clustering
This clustering violates both Homogeneity and
Separation principles
Close distances from points in separate clusters
Far distances from points in the same cluster
16
Good Clustering
This clustering satisfies both Homogeneity and
Separation principles
17
Clustering Techniques
  • Agglomerative Start with every element in its
    own cluster, and iteratively join clusters
    together
  • Divisive Start with one cluster and iteratively
    divide it into smaller clusters
  • Hierarchical Organize elements into a tree,
    leaves represent genes and the length of the
    paths between leaves represents the distances
    between genes. Similar genes lie within the same
    subtrees.

18
Hierarchical Clustering
  • Initially, each element is its own cluster
  • Merge the two closest clusters, and recurse
  • Key question What is closest?
  • How do you compute the distance between clusters?

19
Hierarchical Clustering Computing Distances
  • dmin(C, C) min d(x,y) for all
    elements x in C and y in C
  • Distance between two clusters is the smallest
    distance between any pair of their elements
  • davg(C, C) (1 / CC) ? d(x,y) for all
    elements x in C
  • and y in C
  • Distance between two clusters is the average
    distance between all pairs of their elements

20
Computing Distances (continued)
  • However, we still need a base distance metric
    for pairs of gene
  • Euclidean distance
  • Manhattan distance
  • Dot Product
  • Mutual information

What are some qualitative differences between
these?
21
Geometrical interpretation of distances
  • The distance measures are all related.
  • In some cases, the magnitude of the vector is
    important, in other cases it is not.

22
Comparison between metrics
  • Euclidean and Manhattan tend to perform similarly
    and emphasize the overall magnitude of
    expression.
  • The dot-product is very useful if the shape of
    the expression vector is more important than its
    magnitude.
  • The above metrics are less useful for identifying
    genes for which the expression levels are
    anti-correlated. One might imagine an instance in
    which the same transcription factor can cause
    both enhancement and repression of expression.
    In this case, the squared correlation (r2) or
    mutual information is sometimes used.

23
(No Transcript)
24
But how many orderings can we have?
1
2
4
5
3
25
  • For n leaves there are n-1 internal nodes
  • Each flip in an internal node creates a new
    linear ordering of the leaves
  • There are therefore 2n-1 orderings

E.g., flip this node
1
2
4
5
3
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Bar-Joseph et al. Bioinformatics (2001)
31
Computing an Optimal Ordering
  • Define LT(u,v) as the optimum score of all
    orderings for the subtree rooted at T where
  • u is the left node, and
  • v is the right node
  • Is it sufficient to compute LT(u,v) for all T,u,v
    ?

T
u
v
32
T
T1
T2
v
m
k
u
LT(u,v) max k,m LT1(u,k) LT2(u,m)
33
Time complexity of the algorithm?
T
  • The recursion LT(u,w) is applied for each T,u,v.
    Each recursion takes O(n2) time.
  • Each pair of nodes has a unique Least common
    ancestor.
  • LT(u,w) only needs to be computed if LCA(u,w)
    T
  • Total time O(n4)

w
u
34
Speed Improvements
  • For all m in LT1(u,R)
  • If LT1(u,m)LT2(k0,w) C(T1,T2) lt CurrMax
  • Exit loop
  • For all k in LT1(w,L)
  • If LT1(u,m)LT2(k,w)C(T1,T2) lt CurrMax
  • Exit loop
  • Else recompute CurrMax.
  • In practice, this leads to great speed
    improvements
  • 1500 genes, 7 hrs. changes to 7 min.

35
(No Transcript)
36
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com