Title: Clustering
1Clustering
2Outline
- Introduction
- K-means clustering
- Probabilistic clustering Gaussian mixture models
- Hierarchical clustering COBWEB
3Classification vs. Clustering
Classification Supervised learning Learns a
method for predicting the instance class from
pre-labeled (classified) instances
4Clustering
Unsupervised learning Finds natural grouping
of instances given un-labeled data
5Examples of Clustering Applications
- Marketing discover customer groups and use them
for targeted marketing and re-organization - Astronomy find groups of similar stars and
galaxies - Earth-quake studies Observed earth quake
epicentres should be clustered along continent
faults - Genomics finding groups of genes with similar
patterns of expression
6Clustering Methods
- Many different method and algorithms
- For numeric and/or symbolic data
- Deterministic vs. probabilistic (hard vs. soft)
- Exclusive vs. overlapping
- Hierarchical vs. flat
- Top-down vs. bottom-up
7Clusters exclusive vs. overlapping
Simple 2-D representation Non-overlapping
Venn diagram Overlapping
8Clustering Evaluation
- Manual inspection
- Benchmarking on existing labels (though why
should unsupervised learning discover groupings
based on those particular labels?) - Cluster quality measures
- distance measures
- high similarity within a cluster, low across
clusters
9The distance function
- Simplest case one numeric attribute A
- Distance(X,Y) A(X) A(Y)
- Several numeric attributes
- Distance(X,Y) Euclidean distance between X,Y
- Nominal attributes distance is set to 1 if
values are different, 0 if they are equal - Are all attributes equally important?
- Weighting the attributes might be necessary
10Simple Clustering K-means
- Works with numeric data only
- Pick a number (K) of cluster centres
- Initialise the cluster centre positions (at
random) - Assign every instance to its nearest cluster
centre (e.g. using Euclidean distance) - Move each cluster centre to the mean of its
assigned items - Repeat steps 3, 4 until convergence (no change in
cluster assignments)
11K-means example, step 1
Pick 3 initial cluster centers (randomly)
12K-means example, step 2
Assign each point to the closest cluster center
13K-means example, step 3
Move each cluster center to the mean of each
cluster
14K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
15K-means example, step 4
A three points with animation
16K-means example, step 4b
re-compute cluster means
17K-means example, step 5
move cluster centers to cluster means
18Discussion
- Result can vary significantly depending on
initial choice of centres - Can get trapped in local minimum
- Example
- To increase chance of finding global optimum
restart with different random seeds - Use total distance from instances to
corresponding centres as error measure
19K-means clustering summary
- Advantages
- simple, understandable
- fast to converge
- instances automatically assigned to clusters.
- Disadvantages
- must pick number of clusters beforehand
- all items forced into a cluster
- too sensitive to outliers
- no interpretation of error measure.
20K-means variations
- K-medoids instead of mean, use medians of each
cluster - Mean of 1, 3, 5, 7, 9 is
- Mean of 1, 3, 5, 7, 1009 is
- Median of 1, 3, 5, 7, 1009 is
- Median advantage not affected by extreme values
- For large databases, use sampling
5
205
5
21Probabilistic Clustering
- Goal of density estimation is to assign a
probability density to instances - Useful for clustering, novelty detection,
classification and estimating missing values - Allows for soft cluster membership in range 0,
1 - Focus on mixture models kernel density
estimation is another popular approach
22Mixture Models
- Write the probability density as a weighted sum
of K probability density functions p(x)
Sj P(j) p(x j) Sj aj fj(x). - Mixing coefficients satisfy Sj P(j) 1 and 0 ?
P(j) ? 1. - For many choices of fj we can approximate any
continuous density to arbitrary accuracy (for
large enough K). - Here we choose each fj to be Gaussian with
spherical covariance (same variance in all
directions).
23Training Mixture Models
- Use negative log likelihood log(p(x)) as the
error measure. - Must be careful p(x) is infinite if a component
mean matches an instance and variance ? 0. - If we knew which kernel generated each point,
just estimate the mean and variance of each
kernel from the points it generated. - We do not know which
24EM algorithm
- Expectation Maximization two meanings.
- Equivalent to finding the nearest centre
(K-means) is to calculate posterior probabilities
(responsibilities) P(j x) p(x j)P(j)/p(x) - Represents probability that component j was
responsible for generating x. This is the
E-step. - Now re-estimate kernel parameters using
responsibilities as weights, wi P(j xi)
mj (w1x1 wnxn)/(w1 wn)This is the
M-step.
25EM algorithm discussion
- Iterate until convergence (error measure changes
by a small amount) - Can be extended to allow for missing values (cf.
naïve Bayes) this is a general advantage of many
probabilistic models - Other covariance structures possible can also
use different distributions (e.g. exponential) or
model discrete data (e.g. Bernoulli). - If ??0, responsibilities are 1 or 0 gives us
K-means back again.
26Hierarchical clustering
- Bottom up
- Start with single-instance clusters
- At each step, join the two closest clusters
- Design decision distance between clusters
- E.g. two closest instances in clusters vs.
distance between means - Top down
- Start with one universal cluster
- Find two clusters
- Proceed recursively on each subset
- Can be very fast
- Both methods produce adendrogram clusters
formed bycutting the tree.
27Incremental clustering
- Heuristic approach (COBWEB/CLASSIT)
- Form a hierarchy of clusters incrementally
- Start
- tree consists of empty root node
- Then
- add instances one by one
- update tree appropriately at each stage
- to update, find the right leaf for an instance
- may involve restructuring the tree
- Base update decisions on category utility
28Clustering weather data
2
3
Class (no/yes) for information only
29Clustering weather data
5
Merge best host and runner-up
3
Consider splitting the best host if merging
doesnt help
Restructuring to prevent order of examples having
too much impact
30Final hierarchy
This is a common disadvantage of
incremental algorithms they find locally good
solutions and are affected by the order of
presenting examples
Oops! a and b are actually very similar
31Example the iris data (subset)
32Clustering with cutoff
33Category utility
- Category utility quadratic loss functiondefined
on conditional probabilities - Every instance in different category ? numerator
becomes
maximum
number of attributes
34Overfitting-avoidance heuristic
- If every instance gets put into a different
category the numerator becomes (maximal) - Where n is number of all possible attribute
values. - So without k in the denominator of the
CU-formula, every cluster would consist of one
instance!
Maximum value of CU
35Other Clustering Approaches
- Bayesian clustering
- SOM self-organizing maps
- GTM Generative Topographic Mapping
- Sammon mapping distance-preserving lookup table
- Neuroscale Radial Basis Function
distance-preserving map - . . . . .
36Discussion
- Can interpret clusters by using supervised
learning - learn a classifier based on clusters
- Decrease dependence between attributes?
- pre-processing step
- E.g. use principal component analysis
- Can be used to fill in missing values
- Key advantage of probabilistic clustering
- Can estimate likelihood of data
- Use it to compare different models objectively
37Clustering Summary
- unsupervised
- many approaches
- K-means simple, sometimes useful
- K-medoids is less sensitive to outliers
- Hierarchical clustering works for symbolic
attributes - Evaluation is a problem