Theoretical Foundations of Clustering MLSS Tutorial - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Theoretical Foundations of Clustering MLSS Tutorial

Description:

All apply clustering to gain a first understanding of the structure of ... (X, d) are isomorphic, if there exists a distance-preserving automorphism F : X X, ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 49
Provided by: shaibe
Category:

less

Transcript and Presenter's Notes

Title: Theoretical Foundations of Clustering MLSS Tutorial


1
Theoretical Foundations ofClustering MLSS
Tutorial
  • Shai Ben-David
  • University of Waterloo,
  • Waterloo,
  • Canada

2
The Theory-Practice Gap
Clustering is one of the most widely used
tool for exploratory data analysis. Social
Sciences Biology Astronomy Computer
Science . . All apply clustering to gain
a first understanding of the structure of large
data sets.
Yet, there exist distressingly little
theoretical understanding of clustering
3
Overview of this tutorial
  • What is clustering? Can we formally define it?
  • Model selection issues How would you chose the
    best clustering paradigm for your data? How
    should you choose the number of clusters?
  • Computational complexity issues Can good
    clustering be efficiently computed?

4
Questions that research of fundamentals of
clustering should address
  • Can clustering be given an formal and general
    definition?
  • What is a good clustering?
  • Can we distinguish clusterable from
    structureless data?
  • Can we distinguish meaningful clustering from
    random structure?

5
Inherent Obstacles
Clustering is not well defined. There is a wide
variety of different clustering tasks, with
different (often implicit) measures of quality.
6
There are Many Clustering Tasks
  • Clustering is an ill defined problem

There are many different clustering
tasks, leading to different clustering
paradigms
7
There are Many Clustering Tasks
  • Clustering is an ill defined problem

There are many different clustering tasks,
leading to different clustering
paradigms
8
Some more examples
9
Some real examples of clustering ambiguity
  • Cluster paintings
  • by painter vs. topic
  • Cluster speech recordings
  • by speaker vs. content
  • Cluster text documents
  • by sentiment vs. topic

10
Other Inherent Obstacles
In most practical clustering tasks there is no
clear ground truth to evaluate your solution
by. (in contrast with classification tasks, in
which you can have a hold out labeled set to
evaluate the classifier against).

11
Examples of some popular clustering paradigms
Linkage Clustering
  • Given a set of points and distances between them,
    we extend the distance function to
  • apply to any pair of domain subsets. Then the
    clustering algorithm proceeds in stages.
  • In each stage the two clusters that have the
    minimal distance between them are merged.
  • The user has to set the stopping criteria
    when should the merging stop.

12
Single Linkage Clustering- early stopping
13
Single Linkage Clustering correct stopping
14
Single Linkage Clustering late stopping
15
Examples of some popular clustering paradigms
Center-Based Clustering
  • The algorithm picks k center points
  • and the clusters are defined by assigning
    each domain point to the center closest to it.
  • The algorithm aims to minimize some cost
  • function that reflects how compact the
    resulting clusters are.
  • Center-based algorithm differ by their choice
    of the cost function (k-means, sum of distances,
    k-median and more)
  • The number of clusters, k, is picked by the
    user.

16
4-Means clustering example
17
Examples of some popular clustering paradigms
  • Single Linkage
  • The K-means algorithm
  • The K-means objective optimization.
  • Spectral clustering
  • The actual algorithms
  • The objective function justification.
  • EM algorithms over parameterized families of
    (mixtures of simple) distributions.

18
Common Solutions
Objective utility functions Sum Of In-Cluster
Distances, Average Distances to Center Points,
Cut Weight, etc. (Shmoys, Charikar, Meyerson
)
Consider a restricted set of distributions
(generative models) E., g, Mixtures of
Gaussians Dasgupta 99, Vempala,, 03,
Kannan et al 04, Achlitopas, McSherry 05.
Add structure Relevant Information
Information Bottleneck approach Tishby,
Pereira, Bialek 99
19
Common Solutions (2)
Focus on specific algorithmic paradigms Projectio
ns based clustering (random/spectral) all the
above papers
Spectral-based representations (Meila and Shi,
Belkin, ..) The k-means algorithm
Axiomatic approach Postulate clustering axioms
that, ideally, every clustering approach should
satisfy - usually conclude negative results (e.g.
Hartigan 1975, Puzicha, Hofmann, Buhmann 00,
Kleinberg 03).
Many more
20
Quest for a general Clustering theory
  • What can we say independently of any particular
    algorithm,
  • particular objective function
  • or specific generative data model
  • ?

21
Many different clustering setups
  • Different inputs
  • Points in Euclidean space.
  • Arbitrary domain with a point-similarity measure.
  • A graph (e.g., social networks, web pageslinks)
  • ..
  • Different outputs
  • Hierarchical (dendograms)
  • Partitioning of the domain
  • Soft/probabilistic clusters
  • ..

22
Our Basic Setting for Formal Discussion
  • For a finite domain set S, a dissimilarity
    function (DF) is a mapping, dSxS ? R, such
    that
  • d is symmetric,
  • and
  • d(x,y)0 iff xy.
  • Our Input A dissimilarity function on S (or a
    matrix of pairwise distances between domain
    points)
  • Our Output A partition of S.
  • We wish to define the properties that
    distinguish clustering functions from other
    functions that output domain partitions.

23
  • Input

Output a partition of the domain, x1,x7,
x2,x5,x9
24
Kleinbergs Axioms
  • Scale Invariance
  • F(?d)F(d) for all d and all strictly
    positive ?.
  • Richness
  • For any finite domain S,
  • F(d) d is a DF over SPP a partition of
    S
  • Consistency
  • If d equals d, except for shrinking distances
    within clusters of F(d) or stretching
    between-cluster distances , then F(d)F(d).

25
Note that any pair is realizable
  • Consider Single-Linkage with different
    stopping criteria
  • k connected components.
  • Distance r stopping.
  • Scale a stopping
  • add edges as long as their length is
  • at most a(max-distance)

26
The Surprising result
  • Theorem There exist no clustering function
  • (that satisfies all of the three Kleinberg
    axioms simultaneously).

27
Kleinbergs Impossibility result
  • There exist no clustering function
  • Proof

28
What is the Take-Home Message?
  • A popular interpretation of Kelinbergs result
    is (roughly)
  • Its Impossible to axiomatize clustering
  • But, what that paper shows is (only)
  • These specific three axioms cannot work

29
Ideal Theory
  • We would like the axioms to be such that
  • 1. Any clustering method satisfies all the
    axioms,
  • and
  • 2. Any function that is clearly not a
    clustering fails to
  • satisfy at least one of the axioms.
  • (this is probably too much to hope for).
  • We would like to have a list of simple properties
  • so that major clustering methods are
    distinguishable from each other using these
    properties.

30
Axioms to guide a taxonomy of clustering paradigms
  • The goal is to generate a variety of axioms (or
    properties) over a fixed framework, so that
    different clustering approaches could be
    classified by the different subsets of axioms
    they satisfy.

Axioms
Properties
31
Types of Axioms/Properties
  • Richness requirements
  • E.g., relaxations of Kelinbergs richness,
    e.g.,
  • F(d) d is a DF over SPP a partition of S
    into k sets
  • Invariance/Robustness/Stability requirements.
  • E.g., Scale-Invariance, Consistency,
    robustness
  • to perturbations of d (smoothness of F) or
    stability w.r.t. sampling of S.

32
Relaxations of Consistency
  • Local Consistency
  • Let C1, Ck be the clusters of F(d).
  • For every ?0 1 and positive ?1, ..?k 1, if
    d is defined by
  • ?id(a,b) if a and b are in Ci
  • d(a,b)
  • ?0d(a,b) if a,b are not in same
    F(d)-cluster,
  • then F(d)F(d).
  • Is there any known clustering method for which
    it fails?

33
Some more structure
  • For partitions P1, P2 of 1, m say that P1
    refines P2 if every cluster of P1 is contained in
    some cluster of P2.
  • A collection CPi is a chain if, for any P, Q,
    in C, one of them refines the other.
  • A collection of partitions is an antichain, if no
    partition there refines another.
  • Kleibergs impossibility result can be rephrased
    as
  • If F is Scale Invariant and Consistent then
    its range is an antichain.

34
Relaxations of Consistency
  • Refinement Consistency
  • Same as Consistency (shrink in-cluster,
    strech between-clusters) but we relax the
    Consistency requirement F(d)F(d) to
  • one of F(d), F(d) is a refinement of
    the other.
  • Note A natural version of Single Linkage (join
    x,y, iff d(x,y) lt ?maxd(s,t) s,t in X)
    satisfies this Scale Invariance Richness.
  • So Kleinbergs impossibility result breaks
    down.
  • Should this be an axiom?
  • Is there any common clustering function that
    fails that?

35
More on Refinement Consistency
  • Minimize Sum of In-Cluster Distances satisfies
    it
  • (as well as Richness and Scale Invariance).
  • Center-Based clustering fails to satisfy
    Refinement Consistency
  • This is quite surprising, since they look very
    much alike.

(Where d is Euclidean distance, and ci the center
of mass of Ci)
36
Hierarchical Clustering
  • Hierarchical clustering takes, on top of d, a
    coarseness parameter t.
  • For any fixed t, F(t,d) is a clustering
    function.
  • We require, for every d
  • CdF(t,d) 0 t Max a chain.
  • F(0,d) x x e S and F(Max,d)S.

37
Hierarchical versions of axioms
  • Scale Invariance For any d, and ?gt0,
  • F(t,d) t F(t, ?d)t (as sets of
    partitions).
  • Richness For any finite domain S,
  • F(t,d)t d is a DF over SCC a chain of
    partitions of S (with the needed Min and Max
    partitions).
  • Consistency If, for some t, d is an F(t,d)
    -consistent transformation of d, then, for some
    t, F(t,d)F(t,d)

38
Characterizing Single Linkage
  • Ordinal Clustering axiom
  • If, for all w,x,y,z,
  • d(w,x)ltd(y,z) iff d(w,x)ltd(yz)
  • then F(t,d) t F(t,d)t (as sets of
    partitions).
  • (note that this implies Scale Invariance)
  • Hierarchical Richness Consistency Ordinal
    Clustering characterize Single Linkage
    clustering.

39
Other types of clustering
  • Edge-Detection (advantage to smooth contours)
  • Texture clustering
  • -The professors example.

40
A different setup for axiomatizationMeasuring
the Quality of clustering
  • You get a data set.
  • Run, say, your 5-means clustering algorithm,
  • and get a clustering C.
  • You compute its 5-means cost -- its 0.7.
  • Can you conclude that C is a good clustering?
  • How can we verify that structure described by
  • C is not just noise?

41
Clustering Quality Measures
  • A clustering-quality measure is a function
  • m(dataset, clustering)
  • so that these values reflect how good or
    cogent that clustering is.

42
Axiomatizing Quality Measures
  • Consistency
  • Whenever d is a C consistent variant of d,
  • then m(C,X, d) m(C,X, d).
  • Scale Invariance
  • For every positive ?, m(C,X, d) m(C,X, ?d).
  • Richness
  • For each non-trivial clustering C of X,
  • there exists a distance function d over X
    such that C Argmaxm(C,X, d).

43
An Additional Axiom Isomorphism Invariance
  • Clusterings C and C over (X, d) are isomorphic,
    if there exists a distance-preserving
    automorphism F X ? X,
  • such that x, y share the same C-cluster iff
    F(x) and F(y) share the same C-cluster.
  • Isomorphism Invariance
  • If C and C are isomorphic then
  • m(C,X, d) m(C,X, d).

44
Major gain (over Kleinbergs framework)
  • Every reasonable clustering quality measure
    satisfies our axioms.
  • Clustering functions can be defined as
  • functions that optimize the clustering
    quality.

45
Some examples of quality measures
  • Normalized clustering cost functions
  • (e.g., k-means, ratio-cut, k-median etc. )
  • Variance ratio
  • V R(C,X, d)
  • Relative margins
  • (the average ratio between the distance from a
    point to its cluster center, and its distance to
    its second-closest cluster center).

46
Major gain (over Kleinbergs framework)
  • Every reasonable clustering quality measure
    satisfies our axioms.
  • Clustering functions can be defined as
  • functions that optimize the clustering
    quality.

47
Basic Open Questions
  • What do we want from a set of clustering axioms?
    (Meta axiomatization )
  • How can the completeness of a set of axioms be
    defined/argued?
  • Is there a general property distinguishing,
  • say, linkage-based from center-based
    clusterings?
  • Any candidate general clustering properties
    that the axioms should prove?

48
Single Linkage Clustering
Write a Comment
User Comments (0)
About PowerShow.com