Title: Diapositive 1
1The elements of statistical learning, Hastie Co
Chapter 14 UNSUPERVISED LEARNING - Episode 2 of
3 3 Cluster analysis
2Unsupervised learning
- - Association rules construct simple
descriptions that describe regions of high
density special case of high dimensional binary
valued data - Episode 1 Previous Friday
- Cluster analysis find multiple convex regions
that contains modes of P(X) - mixture modeling
- Episode 2 TODAY
- - PC, multidimensional scaling, self-organized
maps, PCurves, identify low dimensions
manifolds that represent high data density - ? Info about association between variables
- Episode 3 Next (and Last!) Friday
- Measure of success heuristic
314.3 Cluster Analysis - Agenda
3.0 Introduction 3.1 Proximity matrixes 3.2
Dissimilarities based on attributes 3.3 Object
dissimilarity 3.4 Clustering algorithms 3.5
Combinatorial algorithms 3.6 K-means 3.7 Gaussian
mixture as soft K-means 3.8 Example 3.9 Vector
quantization 3.10 K-medoids 3.11 Practical
issues 3.12 Hierarchical clustering
Dissimilarity measurement
Algorithms
43.0 Introduction
- GOAL grouping or segmenting a collection of
objects into subsets (clusters) s.t. objects
within a cluster are closer than objects in
different clusters ? notion of degree of
(dis)similarity between individual objects ?
measure of (dis)similarity ?
53.1 Proximity matrices
- Data proximity (similarity) or difference
(dissimilarity) Matrix D (NxN), N objects,
dii' "proximity" between i and i' - Most algo
matrix of DISSIMILARITIES, - with dii' ? 0
and dii0 ?i If SIMILARITIES ? suitable
monotone-decreasing function - assume
symmetric dissimilarities matrices If NOT ?
(DDT)/2 ? dii' ? dik dki' generally does
not hold ? seldom distance
63.2 Dissimilarities based on attributes
- Often measurements xij, where i1,,N
(object) and j1,,p (attributes) - Construct
pair-wise dissimilarities 1 dj(xij, xi'j)
dissimilarity between i and i' for the jth
attribute 2 D(xi, xi') dissimilarity
between i and i' - Most common choice dj(xij,
xi'j) (xij-xi'j)² (squared distance) -
Alternatives depend on variable type
7Alternatives (1)
- Quantitative variables ? l monotone, ?
function ? If standardized inputs,
clustering on correlation (similarities) ?
clustering on squared distance (dissimilarities)
- Ordinal variables M-values ? (i-0.5)/M ?
quantitative variable
8Alternatives (2)
- Categorical variables degree of difference
between pairs to be determined M values ? L
symmetric MxM matrix Lrr' Lr'r , Lrr 0 , Lrr'
? 0 , common choice Lrr' 1 Unequal losses
can be used to emphasize some errors more than
others
93.3 Object dissimilarity
- Now combining the p individual attribute
dissimilarities dj(xij, xi'j) into a single
measure dii' D(xi, xi') - Done by weighted
average (convex combination) With wj 1/p
does not give equal influence to all attributes
10Weights (1)
Influence of Xj on D(xi, xi') depends upon its
relative contribution to where average
dissimilarity on the jth attribute ? relative
influence ? equal influence if
11Weights (2)
With p quantitative variables and squared error
dissimilarity and ? relative influence is
proportional to its variance
123.4 Cluster algorithms
- GOAL to partition the observations into
clusters s.t. pairwise dissimilarities between
objects of the same cluster are SMALLER than
those in different clusters - 3 types ?
Combinatorial algorithms work directly on
observed data with no direct reference to on
underlying proba model ? Mixture modeling
data are supposed to be iid sample from a
population described by a density function which
is a mixture (methods maximum likelihood or
Bayesian) ? Mode seekers seek modes of
density function (non parametric) example PRIM
133.5 Combinatorial algorithms
Directly assign observation i (i1,,N) to a
cluster k (k1,,K) KltN, pre-specified "Many-to-o
ne mapping" (encoder) kC(i) Seeking
particular encoder, C, that achieves the goal
based on d(xi, xi') Minimization of a loss
function Natural loss ('energy')
function "Within cluster" point scatter
14Scatter decomposition T "Total" point
scatter, constant, independent of C B "Between
cluster" point scatter W "Within cluster" point
scatter Min W(C)Max B(C) ? seeking over all
possible assignments distinct assignments
S(10,4)34105
153.6 K-means
Problem with combinatorial algo ? iterative
greedy descent algo K-means one of the most
popular DATA quantitative variables AND
dissimilarity ? squared Euclidean distance So
16Seeking C s.t. Correspond to the enlarged
optimization problem Since, for any set of
observations S
EQ 1
EQ 2
17K-means algorithm
Step 1 For a given C, Yielding Step 2
Given a current set of means m1,mK Iterate
steps 1 2 until assignments do not change
Convergence ok both steps reduce the value of
criterion
18K-means algorithm - Example
193.7 Gaussian mixture as soft K-means
Relation between K-means and EM-algorithm for
Gaussian mixture K mixture components, each with
Gaussian density having scalar covariance matrix
?²I ? relative density under each mixture
component is a monotone function of the Euclidean
distance between data point and the mixture
center. ? probabilistic assignments
("responsibilities") ? ?² ? 0 both methods
coincide
203.8 Human tumor microarray data
Data 6830 (genes) x 64 (sample) Sample breast,
melanoma, (types of cancer) K-means for K1 to
10. Here are the total within sum of squares
table p 464
213.9 Vector quantization
K-means use in image and signal
compression Fisher's picture 1024x1024 pixels
grey scale 0 to 255 Procedure 1. break image in
small blocks 2x2 2. each of the 512x512 blocks
of 4 numbers ? vector in R4 3. K-means
clustering K200 and K4 Storage entire
picture 1Mb compression 1 239 kb compression
2 62.5 kb
223.10 K-medoids
K-means ok if squared Euclidean distance ?
quantitative variables ? lack of robustness
(sensitive to outliers) Presence only in
minimization step K-medoids ? attribute data ?
means ? cluster centers (one of the observation
in the cluster) K-medoids is more computationally
intensive than K-means
23K-medoids algorithm
Step 1 For a given C, find observations The
n , k1,,K current estimates of cluster
centers Step 2 Given a current set of centers
m1,mK Iterate steps 1 2 until assignments
do not change
24K-medoids Example country dissimilarities
3-medoids multidimensional scaling
253.11 Practical issues
- initial centers or initial encoder, before
K-means or K-medoids - choice of K depends on
goal (idea of K or not) - Estimating K
examination of WK for different values of K (WK
decreases as K increases, cf example) - Gap
statistic in order to find K Comparison of the
curve log WK to the curve obtained from data
uniformly distributed. Optimal K where the gap
between both curves is largest
26Gap statistic Simulated example
27Simulated example original data
283.12 Hierarchical clustering
K-means K-medoids results depend on the
choice of K and on starting configuration
assignment Hierarchical clustering not those
specifications, we just need a measure of
dissimilarity between (disjoint) GROUPS of
observations, based on pairwise dissimilarities
among the observations in the 2 groups. ?
hierarchical representations
292 types of strategies - agglomerative bottom
up, grouping 2 groups with the smallest
intergroup dissimilarity - divisive top down,
splitting one group in 2 groups with the largest
between-group dissimilarity ? N-1 levels in the
hierarchy When stopping? Gap statistic can be
used in the decision Recursive binary
splitting/agglomeration can be represented by a
rooted binary tree Monotonicity property (all
agglo, some divisive) dissimilarity between
merged clusters is monotone increasing with the
level of the merger ? height of each node is
proportional to the value of the intergroup
dissimilarity between its 2 daughters ? dendrogram
30Dendrogram for microarry data
31Agglomerative clustering
START every observation a singleton
cluster EACH STEP (N-1) the closest 2 clusters
are merged ? one less cluster
Measure of "closeness" between clusters? -
Single linkage nearest neighbor - Complete
linkage furthest neighbor - Group average
32Measure of closeness
G, H 2 clusters d(G,H) dissimilarity between
G H, computed from dii' , i?G i'?H - Single
linkage (SL) - Complete linkage (CL) -
Group average (GA)
33Dendrogram for microarry data
34If the data dii' exhibit strong clustering
tendency all 3 methods produce similar
results SL requires only a single dissimilarity
to be small ? drawback, produced clusters can
violate the "compactness" property, cluster with
large diameters CL opposite extreme ? compact
clusters with small diameters, but can violate
the "closeness" property GA compromise, it
attempts to produce relatively compact clusters
and relatively far apart. BUT it depends on the
dissimilarity scale if hii' h(dii' ), h
monotone strictly increasing function, results
for dGA can change. Moreover GA has a
statistical consistency property (not for SL CL)
35Divisive clustering
START entire data set one cluster EACH STEP
(N-1) split one cluster in 2 clusters HOW?
- Recursive application of K-means or K-medoids
with K2 - 1) All observations in a single
cluster G 2) Choice of the observation whose
average dissimilarity from all the other
observations is largest 3) Put this observation
in H, second cluster 4) Observation with largest
difference between average distance from H and
average distance from observations remaining in
G, is transferred to H 5) Stop when this
difference becomes negative
36Unsupervised learning
- - Association rules Episode 1 Previous Friday
- Cluster analysis Episode 2 TODAY
- - PC, multidimensional scaling, self-organized
maps, PCurves, Episode 3 Next (and Last!)
Friday