Cluster Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster Analysis

Description:

... or simply clustering' is a collection of methods for unsupervised class discovery ... [9] 'mu' 'sigma' 'pro' 'loglik' [13] 'modelName' mc.obj$bic. VVV,2 ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 32
Provided by: davidm133
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis


1
Cluster Analysis
  • EPP 245/298
  • Statistical Analysis of
  • Laboratory Data

2
Supervised and Unsupervised Learning
  • Logistic regression and Fishers LDA and QDA are
    examples of supervised learning.
  • This means that there is a training set which
    contains known classifications into groups that
    can be used to derive a classification rule.
  • This can be then evaluated on a test set, or
    this can be done repeatedly using cross
    validation.

3
Unsupervised Learning
  • Unsupervised learning means (in this instance)
    that we are trying to discover a division of
    objects into classes without any training set of
    known classes, without knowing in advance what
    the classes are, or even how many classes there
    are.
  • It should not have to be said that this is a
    difficult task

4
Cluster Analysis
  • Cluster analysis, or simply clustering is a
    collection of methods for unsupervised class
    discovery
  • These methods are widely used for gene expression
    data, proteomics data, and other omics data types
  • They are likely more widely used than they should
    be
  • One can cluster subjects (types of cancer) or
    genes (to find pathways or co-regulation).

5
Distance Measures
  • It turns out that the most crucial decision to
    make in choosing a clustering method is defining
    what it means for two vectors to be close or far.
  • There are other components to the choice, but
    these are all secondary
  • Often the distance measure is implicit in the
    choice of method, but a wise decision maker knows
    what he/she is choosing.

6
  • A true distance, or metric, is a function defined
    on pairs of objects that satisfies a number of
    properties
  • D(x,y) D(y,x)
  • D(x,y) 0
  • D(x,y) 0 ? x y
  • D(x,y) D(y,z) D(x,z) (triangle inequality)
  • The classic example of a metric is Euclidean
    distance. If x (x1,x2,xp), and y(y1,y2,yp) ,
    are vectors, the Euclidean distance is
    ??(x1-y1)2??(xp-yp)2

7
Euclidean Distance
y (y1,y2)
D(x,y)
x2-y2
x1-y1
x (x1,x2)
8
Triangle Inequality
x
D(x,z)
D(x,y)
y
D(y,z)
z
9
Other Metrics
  • The city block metric is the distance when only
    horizontal and vertical travel is allowed, as in
    walking in a city.
  • It turns out to be?x1-y1??xp-yp instead of
    the Euclidean distance ?(x1-y1)2??(xp-yp)2

10
Mahalanobis Distance
  • Mahalanobis distance is a kind of weighted
    Euclidean distance
  • It produces distance contours of the same shape
    as a data distribution
  • It is often more appropriate than Euclidean
    distance

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
Non-Metric Measures of Similarity
  • A common measure of similarity used for
    microarray data is the (absolute) correlation.
  • This rates two data vectors as similar if they
    move up and down together, without worrying about
    their absolute magnitudes
  • This is not a metric, since if violates several
    of the required properties
  • We use 1 - ? as the distance

15
Agglomerative Hierarchical Clustering
  • We start with all data items as individuals
  • In step 1, we join the two closest individuals
  • In each subsequent step, we join the two closest
    individuals or clusters
  • This requires defining the distance between two
    groups as a number that can be compared to the
    distance between individuals
  • We can use the R commands hclust or agnes

16
Group Distances
  • Complete link clustering defines the distance
    between two groups as the maximum distance
    between any element of one group and any of the
    other
  • Single link clustering defines the distance
    between two groups as the minimum distance
    between any element of one group and any of the
    other
  • Average link clustering defines the distance
    between two groups as the mean distance between
    elements of one group and elements of the other

17
gt iris.d lt- dist(iris,14) gt iris.hc lt-
hclust(iris.d) gt plot(iris.hc)
18
(No Transcript)
19
Divisive Clustering
  • Divisive clustering begins with the whole data
    set as a cluster, and considers dividing it into
    k clusters.
  • Usually this is done to optimize some criterion
    such as the ratio of the within cluster variation
    to the between cluster variation
  • The choice of k is important

20
  • K-means is a widely used divisive algorithm (R
    command kmeans)
  • Its major weakness is that it uses Euclidean
    distance
  • Some other routines in R for divisive clustering
    include agnes and fanny in the cluster package
    (library(cluster))

21
gt iris.km lt- kmeans(iris,14,3) gt
plot(prcomp(iris,14)x,coliris.kmcluster) gtgt
table(iris.kmcluster,iris,5) setosa
versicolor virginica 1 0 48 14
2 0 2 36 3 50 0
0 gt
22
(No Transcript)
23
  • Model-based clustering methods allow use of more
    flexible shape matrices. One such package is
    mclust, which needs to be downloaded from CRAN
  • Functions in this package include Mclust (more
    flexible), EMclust (simpler to use)
  • Other excellent software is EMMIX from Geoff
    McLachlan at the University of Queensland.

24
gt data(iris) gt mc.obj lt- Mclust(iris,14) gt
plot.Mclust(mc.obj,iris14) make a plot
selection (0 to exit) 1plot BIC 2plot
Pairs 3plot Classification (2-D projection)
4plot Uncertainty (2-D projection) 5plot
All Selection 1 Hit ltReturngt to see next plot
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
The following models are compared in 'Mclust'
"E" for spherical, equal variance
(one-dimensional) "V" for spherical,
variable variance (one-dimensional)
"EII" spherical, equal volume "VII"
spherical, unequal volume "EEI" diagonal,
equal volume, equal shape "VVI" diagonal,
varying volume, varying shape "EEE"
ellipsoidal, equal volume, shape, and orientation
"VVV" ellipsoidal, varying volume, shape,
and orientation 'Mclust' is intended to
combine 'EMclust' and its 'summary' in a
simiplified one-step model-based clustering
function. The latter provide more
flexibility including choice of models.
29
gt names(mc.obj) 1 "BIC" "bic"
"classification" "uncertainty" 5 "n"
"d" "G" "z"
9 "mu" "sigma"
"pro" "loglik" 13
"modelName" gt mc.objbic VVV,2
-574.0178 gt mc.objBIC EII VII
EEI VVI EEE VVV 1
-1804.0854 -1804.0854 -1527.1308 -1527.1308
-829.9782 -829.9782 2 -1123.4115 -1012.2352
-1047.9786 -867.5728 -688.0972 -574.0178 3
-878.7652 -853.8133 -818.0635 -759.6687
-632.9658 -580.8400 4 -784.3098 -783.8263
-740.4955 -725.1132 -591.4097 -628.9642 5
-734.3863 -746.9928 -699.4019 -725.9635
-604.9287 -683.8194 6 -715.7147 -705.7813
-698.8104 -726.9666 -621.8183 -711.5716 7
-700.3690 -705.0659 -688.4205 -729.2289
-613.4585 -752.7982 8 -686.0964 -710.5799
-666.0947 -741.9637 -622.4215 -790.7023 9
-694.5239 -703.3490 -683.6092 -772.4925
-638.2063 -824.8824
30
Clustering Genes
  • Clustering genes is relatively easy, in the sense
    that we treat an experiment with 60 arrays and
    9,000 genes as if the sample size were 9,000 and
    the dimension 60
  • Extreme care should be taken in selection of the
    explicit or implicit distance function, so that
    it corresponds to the biological intent
  • This is used to find similar genes, identify
    putative co-regulation, and reduce dimension by
    replacing a group of genes by the average

31
Clustering Samples
  • This is much more difficult, since we are using
    the sample size of 60 and dimension of 9,000
  • K-means and hierarchical clustering can work here
  • Model-based clustering requires substantial
    dimension reduction either by gene selection or
    use of PCA or similar methods
Write a Comment
User Comments (0)
About PowerShow.com