A Discriminative Framework for Clustering via Similarity Functions

About This Presentation

Title:

A Discriminative Framework for Clustering via Similarity Functions

Description:

Gucci ... Gucci. tennis. Lacoste. All topics ... Gucci. Algorithm. 12. Strict Separation Property ... –

Number of Views:76

Avg rating:3.0/5.0

Slides: 24

Provided by: mariaflor

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Discriminative Framework for Clustering via Similarity Functions

1
A Discriminative Framework for Clustering via
Similarity Functions
Maria-Florina Balcan
Carnegie Mellon University
Joint with Avrim Blum and Santosh Vempala
2
Brief Overview of the Talk
Supervised Learning
Clustering
Learning from unlabeled data.
Learning from labeled data.
Lack of good unified models.
Good theoretical models

PAC, SLT

Vague, difficult to reason about at a general
technical level.

Kernels Similarity fns

Our work fix the problem
A PAC-style framework
3
Clustering Learning from Unlabeled Data
sports
fashion
S set of n objects.
documents
9 ground truth clustering.
x, l(x) in 1,,t.
topic
Goal h of low error where
err(h) min?PrxS?(h(x)) ? l(x)
Problem unlabeled data only!
But have a Similarity Function!
4
Clustering Learning from Unlabeled Data
sports
fashion
Protocol
9 ground truth clustering for S
The similarity function K has to be related to
the ground-truth.
i.e., each x in S has l(x) in 1,,t.
S, a similarity function K.
Input
Clustering of small error.
Output
5
Clustering Learning from Unlabeled Data
sports
fashion
Fundamental Question
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
6
Contrast with Standard Approaches
Approximation algorithms
Mixture models
Input embedding into Rd
Input graph or embedding into Rd

score algs based on apx ratios

score algs based on error rate

- analyze algs to optimize various criteria over
edges

strong probabilistic assumptions

Clustering Theoretical Frameworks
Discriminative, not generative.
Our Approach
Much better when input graph/ similarity is
based on heuristics.
Input graph or similarity info

score algs based on error rate

E.g., clustering documents by topic, web
search results by category

no strong probabilistic assumptions

7
Condition that trivially works.
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
sports
fashion
C
C
K(x,y) gt 0 for all x,y, l(x) l(y).K(x,y) lt 0
for all x,y, l(x) ? l(y).
A
A
8
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
All x more similar to all y in own cluster than
any z in any other cluster
Problem same K can satisfy it for two very
different, equally natural clusterings of the
same data!
K(x,x)1
K(x,x)0.5
K(x,x)0
9
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
10
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
All topics
sports
fashion
tennis
Lacoste
soccer
Gucci
2. List of clusterings s.t. at least one has
low error.
Tradeoff strength of assumption with size of list.
Obtain a rich, general model.
11
Strict Separation Property
All x more similar to all y in own cluster than
any z in any other cluster
Sufficient for hierarchical clustering
(If K is symmetric)
Single-Linkage.
Algorithm

merge parts whose max similarity is highest.

All topics
sports
fashion
tennis
Lacoste
soccer
Gucci
12
Strict Separation Property
All x more similar to all y in own cluster than
any z in any other cluster
Use Single-Linkage, construct a tree s.t.
ground-truth clustering is a pruning of the tree.
Theorem
Incorporate Approximation Assumptions in Our Model
If use c-approx. alg. to objective f (e.g,
k-median) to minimize error rate, then implicit
assumption
Clusterings within factor c of optimal are
?-close to the target.
Most points (1-O(?) fraction) satisfy Strict
Separation.
k-median, k-means
Can still cluster well in the tree model.
13
Stability Property
C
C
For all C, C, all A ½ C, A µ C,
K(A,C-A) gt K(A,A)
A
A
(K(A,A) - average attraction between A and A)
Neither A or A more attracted to the other
one than to the rest of its own cluster.
Sufficient for hierarchical clustering
Single linkage fails, but average linkage works.
Merge parts whose average similarity is highest.
14
Stability Property
C
C
For all C, C, all A ½ C, A µ C,
K(A,C-A) gt K(A,A)
A
A
(K(A,A) - average attraction between A and A)
Use Average Linkage, construct a tree s.t.
the ground-truth clustering is a pruning of the
tree.
Theorem
All parts laminar wrt target clustering.
Analysis

Failure iff merge P1, P2 s.t. P1½ C, P2Å C ?.

P2
C
P3
P1

K(P1,P3) K(P1,C-P1) and
K(P1,C-P1) gt K(P1,P2).

But must exist P3 ½ C s.t.

Contradiction.
15
Stability Property
C
C
For all C, C, all A ½ C, A µ C,
K(A,C-A) gt K(A,A)
A
A
(K(A,A) - average attraction between A and A)
Average Linkage breaks down if K is not symmetric.
Instead, run Boruvka-inspired algorithm

Each current cluster Ci points to
argmaxCjK(Ci,Cj)

Merge directed cycles.

16
Unified Model for Clustering
Question 1 Given a property of the similarity
function w.r.t. ground truth clustering, what is
a good algorithm?

Property P1
Property Pi
Property Pn
of the similarity function wrt the ground-truth
clustering

Algorithm A1
Algorithm A2
Algorithm Am
17
Unified Model for Clustering
Question 2 Given the algorithm, what
property of the similarity function w.r.t. ground
truth clustering should the expert aim for?

Property P1
Property Pi
Property Pn
of the similarity function wrt the ground-truth
clustering

Algorithm A1
Algorithm A2
Algorithm Am
18
Other Examples of Properties and Algorithms
Average Attraction Property
Ex 2 C(x)K(x,x) gt Ex 2 C K(x,x)? (8
C?C(x))
Not sufficient for hierarchical clustering
Can produce a small list of clusterings.
(sampling based algorithm)
Upper bound
Lower bound
tO(t/?2 log t/?)
tO(1/?)
C
Stability of Large Subsets Property
C
For all clusters C, C, for all Aµ C, A µ C,
AA sn, neither A nor A more attracted to
the other one than to the rest of its own cluster.
A
A
Sufficient for hierarchical clustering
Find hierarchy using a multi-stage
learning-based algorithm.
19
Stability of Large Subsets Property
Clustering
C
C
For all C, C, all A ½ C, A µ C,
K(A,C-A) gt K(A,A)
AA sn
A
A
Algorithm

Generate list L of candidate clusters (average
attraction alg.)

Ensure that any ground-truth cluster is f-close
to one in L.

For every (C, C0) in L s.t. all three parts are
large

If K(C Å C0, C \ C0) K(C Å C0, C0 \ C),
then throw out C0
Else throw out C.
3) Clean and hook up the surviving clusters
into a tree.
20
Stability of Large Subsets
Clustering
C
C
For all C, C, all A½C, AµC, AA sn
K(A,C-A) gt K(A,A)?
A
A
If sO(?2/k2), fO(?2 ?/k2), then produce
a tree s.t. the ground-truth is ?-close to a
pruning.
Theorem
21
The Inductive Setting
Inductive Setting
Draw sample S, cluster S (in the list or tree
model).
Insert new points as they arrive.
instance space X
Sample S
x
x
x
x
Many of our algorithms extend naturally to this
setting.
To get poly time for stab of all subsets,
need to argue that sampling preserves stability.
AFKK
22
Similarity Functions for Clustering, Summary
Main Conceptual Contributions

Natural conditions on K to be useful for
clustering.

For robust theory, relax objective hierarchy,
list.

A general model that parallels PAC, SLT, Learning
with Kernels and Similarity Functions in
Supervised Classification.

Technically Most Difficult Aspects

Algos for stability of large subsets ?-strict
separation.

Algos and analysis for the inductive setting.

23
(No Transcript)
24
Properties Summary
25
Thm If 9 bad set B of ?n pts s.t. S S-B
satisfies strict ordering, and if all clusters of
size 5?n, then can find tree of error ?.

Algorithm sketch
For each x,m generate cluster of m most similar
pts to x. Delete any of size lt 4?n.
If two clusters like this
have each y in intersection choose based on
median similarity to C-C, C-C.

C C
26
Thm If 9 bad set B of ?n pts s.t. S S-B
satisfies strict ordering, and if all clusters of
size 5?n, then can find tree of error ?.

Algorithm sketch
For each x,m generate cluster of m most similar
pts to x. Delete any of size lt 4?n.
If two clusters like this
have each y in CMC choose in or out based on
?n1st most similar in CÅC or S-(CC)

lt 2?n
lt 2?n
C C
27
Thm If 9 bad set B of ?n pts s.t. S S-B
satisfies strict ordering, and if all clusters of
size 5?n, then can find tree of error ?.

Algorithm sketch
For each x,m generate cluster of m most similar
pts to x. Delete any of size lt 4?n.
If two clusters like this
have each y in C-C choose in or out based on
?n1st most similar in CÅC or S-(CC)

lt 2?n
gt 2?n
C C
28
Thm If 9 bad set B of ?n pts s.t. S S-B
satisfies strict ordering, and if all clusters of
size 5?n, then can find tree of error ?.

Algorithm sketch
For each x,m generate cluster of m most similar
pts to x. Delete any of size lt 4?n.
Then argue that never hurts correct clusters (wrt
S) and each step makes progress.

Write a Comment

User Comments (0)