Clustering - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Clustering

Description:

Astronomy: find groups of similar stars and galaxies. Earth-quake studies: Observed earth quake epicentres should be clustered along continent faults ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 36

Provided by: gregoryp8

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
Clustering
2
Outline

Introduction
K-means clustering
Probabilistic clustering Gaussian mixture models
Hierarchical clustering COBWEB

3
Classification vs. Clustering
Classification Supervised learning Learns a
method for predicting the instance class from
pre-labeled (classified) instances
4
Clustering
Unsupervised learning Finds natural grouping
of instances given un-labeled data
5
Examples of Clustering Applications

Marketing discover customer groups and use them
for targeted marketing and re-organization
Astronomy find groups of similar stars and
galaxies
Earth-quake studies Observed earth quake
epicentres should be clustered along continent
faults
Genomics finding groups of genes with similar
patterns of expression

6
Clustering Methods

Many different method and algorithms
For numeric and/or symbolic data
Deterministic vs. probabilistic (hard vs. soft)
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up

7
Clusters exclusive vs. overlapping
Simple 2-D representation Non-overlapping
Venn diagram Overlapping

8
Clustering Evaluation

Manual inspection
Benchmarking on existing labels (though why
should unsupervised learning discover groupings
based on those particular labels?)
Cluster quality measures
distance measures
high similarity within a cluster, low across
clusters

9
The distance function

Simplest case one numeric attribute A
Distance(X,Y) A(X) A(Y)
Several numeric attributes
Distance(X,Y) Euclidean distance between X,Y
Nominal attributes distance is set to 1 if
values are different, 0 if they are equal
Are all attributes equally important?
Weighting the attributes might be necessary

10
Simple Clustering K-means

Works with numeric data only
Pick a number (K) of cluster centres
Initialise the cluster centre positions (at
random)
Assign every instance to its nearest cluster
centre (e.g. using Euclidean distance)
Move each cluster centre to the mean of its
assigned items
Repeat steps 3, 4 until convergence (no change in
cluster assignments)

11
K-means example, step 1
Pick 3 initial cluster centers (randomly)
12
K-means example, step 2
Assign each point to the closest cluster center
13
K-means example, step 3
Move each cluster center to the mean of each
cluster
14
K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
15
K-means example, step 4
A three points with animation
16
K-means example, step 4b
re-compute cluster means
17
K-means example, step 5
move cluster centers to cluster means
18
Discussion

Result can vary significantly depending on
initial choice of centres
Can get trapped in local minimum
Example
To increase chance of finding global optimum
restart with different random seeds
Use total distance from instances to
corresponding centres as error measure

19
K-means clustering summary

Advantages
simple, understandable
fast to converge
instances automatically assigned to clusters.

Disadvantages
must pick number of clusters beforehand
all items forced into a cluster
too sensitive to outliers
no interpretation of error measure.

20
K-means variations

K-medoids instead of mean, use medians of each
cluster
Mean of 1, 3, 5, 7, 9 is
Mean of 1, 3, 5, 7, 1009 is
Median of 1, 3, 5, 7, 1009 is
Median advantage not affected by extreme values
For large databases, use sampling

5
205
5
21
Probabilistic Clustering

Goal of density estimation is to assign a
probability density to instances
Useful for clustering, novelty detection,
classification and estimating missing values
Allows for soft cluster membership in range 0,
1
Focus on mixture models kernel density
estimation is another popular approach

22
Mixture Models

Write the probability density as a weighted sum
of K probability density functions p(x)
Sj P(j) p(x j) Sj aj fj(x).
Mixing coefficients satisfy Sj P(j) 1 and 0 ?
P(j) ? 1.
For many choices of fj we can approximate any
continuous density to arbitrary accuracy (for
large enough K).
Here we choose each fj to be Gaussian with
spherical covariance (same variance in all
directions).

23
Training Mixture Models

Use negative log likelihood log(p(x)) as the
error measure.
Must be careful p(x) is infinite if a component
mean matches an instance and variance ? 0.
If we knew which kernel generated each point,
just estimate the mean and variance of each
kernel from the points it generated.
We do not know which

24
EM algorithm

Expectation Maximization two meanings.
Equivalent to finding the nearest centre
(K-means) is to calculate posterior probabilities
(responsibilities) P(j x) p(x j)P(j)/p(x)
Represents probability that component j was
responsible for generating x. This is the
E-step.
Now re-estimate kernel parameters using
responsibilities as weights, wi P(j xi)
mj (w1x1 wnxn)/(w1 wn)This is the
M-step.

25
EM algorithm discussion

Iterate until convergence (error measure changes
by a small amount)
Can be extended to allow for missing values (cf.
naïve Bayes) this is a general advantage of many
probabilistic models
Other covariance structures possible can also
use different distributions (e.g. exponential) or
model discrete data (e.g. Bernoulli).
If ??0, responsibilities are 1 or 0 gives us
K-means back again.

26
Hierarchical clustering

Bottom up
Start with single-instance clusters
At each step, join the two closest clusters
Design decision distance between clusters
E.g. two closest instances in clusters vs.
distance between means
Top down
Start with one universal cluster
Find two clusters
Proceed recursively on each subset
Can be very fast
Both methods produce adendrogram clusters
formed bycutting the tree.

27
Incremental clustering

Heuristic approach (COBWEB/CLASSIT)
Form a hierarchy of clusters incrementally
Start
tree consists of empty root node
Then
add instances one by one
update tree appropriately at each stage
to update, find the right leaf for an instance
may involve restructuring the tree
Base update decisions on category utility

28
Clustering weather data

2
3
Class (no/yes) for information only
29
Clustering weather data

5
Merge best host and runner-up
3
Consider splitting the best host if merging
doesnt help
Restructuring to prevent order of examples having
too much impact
30
Final hierarchy
This is a common disadvantage of
incremental algorithms they find locally good
solutions and are affected by the order of
presenting examples
Oops! a and b are actually very similar
31
Example the iris data (subset)
32
Clustering with cutoff
33
Category utility

Category utility quadratic loss functiondefined
on conditional probabilities
Every instance in different category ? numerator
becomes

maximum
number of attributes
34
Overfitting-avoidance heuristic

If every instance gets put into a different
category the numerator becomes (maximal)
Where n is number of all possible attribute
values.
So without k in the denominator of the
CU-formula, every cluster would consist of one
instance!

Maximum value of CU
35
Other Clustering Approaches