Clustering - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Clustering

Description:

Astronomy: find groups of similar stars and galaxies. Earth-quake studies: Observed earth quake epicentres should be clustered along continent faults ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 36
Provided by: gregoryp8
Category:
Tags: clustering | quake

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
2
Outline
  • Introduction
  • K-means clustering
  • Probabilistic clustering Gaussian mixture models
  • Hierarchical clustering COBWEB

3
Classification vs. Clustering
Classification Supervised learning Learns a
method for predicting the instance class from
pre-labeled (classified) instances
4
Clustering
Unsupervised learning Finds natural grouping
of instances given un-labeled data
5
Examples of Clustering Applications
  • Marketing discover customer groups and use them
    for targeted marketing and re-organization
  • Astronomy find groups of similar stars and
    galaxies
  • Earth-quake studies Observed earth quake
    epicentres should be clustered along continent
    faults
  • Genomics finding groups of genes with similar
    patterns of expression

6
Clustering Methods
  • Many different method and algorithms
  • For numeric and/or symbolic data
  • Deterministic vs. probabilistic (hard vs. soft)
  • Exclusive vs. overlapping
  • Hierarchical vs. flat
  • Top-down vs. bottom-up

7
Clusters exclusive vs. overlapping
Simple 2-D representation Non-overlapping
Venn diagram Overlapping


8
Clustering Evaluation
  • Manual inspection
  • Benchmarking on existing labels (though why
    should unsupervised learning discover groupings
    based on those particular labels?)
  • Cluster quality measures
  • distance measures
  • high similarity within a cluster, low across
    clusters

9
The distance function
  • Simplest case one numeric attribute A
  • Distance(X,Y) A(X) A(Y)
  • Several numeric attributes
  • Distance(X,Y) Euclidean distance between X,Y
  • Nominal attributes distance is set to 1 if
    values are different, 0 if they are equal
  • Are all attributes equally important?
  • Weighting the attributes might be necessary

10
Simple Clustering K-means
  • Works with numeric data only
  • Pick a number (K) of cluster centres
  • Initialise the cluster centre positions (at
    random)
  • Assign every instance to its nearest cluster
    centre (e.g. using Euclidean distance)
  • Move each cluster centre to the mean of its
    assigned items
  • Repeat steps 3, 4 until convergence (no change in
    cluster assignments)

11
K-means example, step 1
Pick 3 initial cluster centers (randomly)
12
K-means example, step 2
Assign each point to the closest cluster center
13
K-means example, step 3
Move each cluster center to the mean of each
cluster
14
K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
15
K-means example, step 4
A three points with animation
16
K-means example, step 4b
re-compute cluster means
17
K-means example, step 5
move cluster centers to cluster means
18
Discussion
  • Result can vary significantly depending on
    initial choice of centres
  • Can get trapped in local minimum
  • Example
  • To increase chance of finding global optimum
    restart with different random seeds
  • Use total distance from instances to
    corresponding centres as error measure

19
K-means clustering summary
  • Advantages
  • simple, understandable
  • fast to converge
  • instances automatically assigned to clusters.
  • Disadvantages
  • must pick number of clusters beforehand
  • all items forced into a cluster
  • too sensitive to outliers
  • no interpretation of error measure.

20
K-means variations
  • K-medoids instead of mean, use medians of each
    cluster
  • Mean of 1, 3, 5, 7, 9 is
  • Mean of 1, 3, 5, 7, 1009 is
  • Median of 1, 3, 5, 7, 1009 is
  • Median advantage not affected by extreme values
  • For large databases, use sampling

5
205
5
21
Probabilistic Clustering
  • Goal of density estimation is to assign a
    probability density to instances
  • Useful for clustering, novelty detection,
    classification and estimating missing values
  • Allows for soft cluster membership in range 0,
    1
  • Focus on mixture models kernel density
    estimation is another popular approach

22
Mixture Models
  • Write the probability density as a weighted sum
    of K probability density functions p(x)
    Sj P(j) p(x j) Sj aj fj(x).
  • Mixing coefficients satisfy Sj P(j) 1 and 0 ?
    P(j) ? 1.
  • For many choices of fj we can approximate any
    continuous density to arbitrary accuracy (for
    large enough K).
  • Here we choose each fj to be Gaussian with
    spherical covariance (same variance in all
    directions).

23
Training Mixture Models
  • Use negative log likelihood log(p(x)) as the
    error measure.
  • Must be careful p(x) is infinite if a component
    mean matches an instance and variance ? 0.
  • If we knew which kernel generated each point,
    just estimate the mean and variance of each
    kernel from the points it generated.
  • We do not know which

24
EM algorithm
  • Expectation Maximization two meanings.
  • Equivalent to finding the nearest centre
    (K-means) is to calculate posterior probabilities
    (responsibilities) P(j x) p(x j)P(j)/p(x)
  • Represents probability that component j was
    responsible for generating x. This is the
    E-step.
  • Now re-estimate kernel parameters using
    responsibilities as weights, wi P(j xi)
    mj (w1x1 wnxn)/(w1 wn)This is the
    M-step.

25
EM algorithm discussion
  • Iterate until convergence (error measure changes
    by a small amount)
  • Can be extended to allow for missing values (cf.
    naïve Bayes) this is a general advantage of many
    probabilistic models
  • Other covariance structures possible can also
    use different distributions (e.g. exponential) or
    model discrete data (e.g. Bernoulli).
  • If ??0, responsibilities are 1 or 0 gives us
    K-means back again.

26
Hierarchical clustering
  • Bottom up
  • Start with single-instance clusters
  • At each step, join the two closest clusters
  • Design decision distance between clusters
  • E.g. two closest instances in clusters vs.
    distance between means
  • Top down
  • Start with one universal cluster
  • Find two clusters
  • Proceed recursively on each subset
  • Can be very fast
  • Both methods produce adendrogram clusters
    formed bycutting the tree.

27
Incremental clustering
  • Heuristic approach (COBWEB/CLASSIT)
  • Form a hierarchy of clusters incrementally
  • Start
  • tree consists of empty root node
  • Then
  • add instances one by one
  • update tree appropriately at each stage
  • to update, find the right leaf for an instance
  • may involve restructuring the tree
  • Base update decisions on category utility

28
Clustering weather data
  • 1

2
3
Class (no/yes) for information only
29
Clustering weather data
  • 4

5
Merge best host and runner-up
3
Consider splitting the best host if merging
doesnt help
Restructuring to prevent order of examples having
too much impact
30
Final hierarchy
This is a common disadvantage of
incremental algorithms they find locally good
solutions and are affected by the order of
presenting examples
Oops! a and b are actually very similar
31
Example the iris data (subset)
32
Clustering with cutoff
33
Category utility
  • Category utility quadratic loss functiondefined
    on conditional probabilities
  • Every instance in different category ? numerator
    becomes

maximum
number of attributes
34
Overfitting-avoidance heuristic
  • If every instance gets put into a different
    category the numerator becomes (maximal)
  • Where n is number of all possible attribute
    values.
  • So without k in the denominator of the
    CU-formula, every cluster would consist of one
    instance!

Maximum value of CU
35
Other Clustering Approaches
  • Bayesian clustering
  • SOM self-organizing maps
  • GTM Generative Topographic Mapping
  • Sammon mapping distance-preserving lookup table
  • Neuroscale Radial Basis Function
    distance-preserving map
  • . . . . .

36
Discussion
  • Can interpret clusters by using supervised
    learning
  • learn a classifier based on clusters
  • Decrease dependence between attributes?
  • pre-processing step
  • E.g. use principal component analysis
  • Can be used to fill in missing values
  • Key advantage of probabilistic clustering
  • Can estimate likelihood of data
  • Use it to compare different models objectively

37
Clustering Summary
  • unsupervised
  • many approaches
  • K-means simple, sometimes useful
  • K-medoids is less sensitive to outliers
  • Hierarchical clustering works for symbolic
    attributes
  • Evaluation is a problem
Write a Comment
User Comments (0)
About PowerShow.com