Correlation Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

Correlation Clustering

Description:

Harry B. Totally consistent: edges inside clusters. edges ... Harry B. Train up f(x)= same/different. Run f on all pairs. Find most consistent clustering ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 41

Provided by: nikhil5

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Correlation Clustering

1
Correlation Clustering

Nikhil Bansal
Joint Work with Avrim Blum and
Shuchi Chawla

2
Clustering

Say we want to cluster n objects of some kind
(documents, images, text strings)
But we dont have a meaningful way to project
into Euclidean space.
Idea of Cohen, McCallum, Richman use past data
to train up f(x,y)same/different.
Then run f on all pairs and try to find most
consistent clustering.

3
The problem
Harry B.
Harry Bovik
H. Bovik
Tom X.
4
The problem
Harry B.

Harry Bovik
Same - Different

H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
5
The problem
Harry B.

Harry Bovik
Same - Different

H. Bovik
Tom X.

Totally consistent
edges inside clusters
edges outside clusters

6
The problem
Harry B.

Harry Bovik
Same - Different

H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
7
The problem
Harry B.

Harry Bovik
Same - Different

Disagreement
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
Find most consistent clustering
8
The problem
Harry B.

Harry Bovik
Same - Different

Disagreement
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
Find most consistent clustering
9
The problem
Harry B.

Harry Bovik
Same - Different

Disagreement
H. Bovik
Tom X.

Problem Given a complete graph on n vertices.
Each edge labeled or -.
Goal is to find partition of vertices as
consistent as possible with edge labels.
Max (agreements) or Min ( disagreements)
There is no k of clusters could be
anything

10
The Problem

Noise Removal
There is some true clustering. However some edges
incorrect. Still want to do well.
Agnostic Learning
No inherent clustering.
Try to find the best representation using
hypothesis with limited representation power.
Eg Research communities via collaboration graph

11
Our results

Constant-factor approx for minimizing
disagreements.
PTAS for maximizing agreements.
Results for random noise case.

12
PTAS for maximizing agreements

Easy to get ½ of the edges
Goal additive apx of en2 .
Standard approach
Draw small sample,
Guess partition of sample,
Compute partition of remainder.
Can do directly, or plug into General Property
Tester of GGR.
Running time doubly expl in e, or singly with
bad exponent.

13
Minimizing Disagreements

Goal Get a constant factor approx.
Problem Even if we can find a cluster thats as
good as best cluster of OPT, were headed toward
O(log n) Set-Cover like analysis

Need a way of lower-bounding Dopt.
14
Lower bounding idea bad triangles

Consider

We know any clustering has to disagree with at
least one of these edges.
15
Lower bounding idea bad triangles

If several such disjoint, then mistake on each
one

1

2
5
2 Edge disjoint Bad Triangles (1,2,3), (3,4,5)

4
3

Dopt Edge disjoint bad triangles
(Not Tight)
16
Lower bounding idea bad triangles

If several such, then mistake on each one

1

2
5
Edge disjoint Bad Triangles (1,2,3), (3,4,5)

4
3

Dopt Edge disjoint bad triangles
How can we use this ?
17
d-clean Clusters

Given a clustering, vertex d-good if few
disagreements

N-(v) Within C lt dC N(v) Outside C lt dC
C
v is d-good
Similar - Dissimilar
Essentially, N(v) ¼ C(v)
Cluster C d-clean if all v2C are d-good
18
Observation

Any ?-clean clustering is 8 approx for ?lt1/4

Idea Charging mistakes to bad triangles
w
Intuitively, enough choices of w for each wrong
edge (u,v)

-
v
u
19
Observation

Any ?-clean clustering is 8 approx for ?lt1/4

Idea Charging mistakes to bad triangles
w
Intuitively, enough choices of w for each wrong
edge (u,v) Can find edge-disjoint bad triangles

u

-
-
v
u
Similar argument for ve edges between 2 clusters
20
General Structure of Argument

Any ?-clean clustering is 8 approx for ?lt1/4
Consequence Just need to produce a
d-clean clustering for dlt1/4

21
General Structure of Argument

Any ?-clean clustering is 8 approx for ?lt1/4
Consequence Just need to produce a
d-clean clustering for dlt1/4

Bad News May not be possible !!
22
General Structure of Argument

Any ?-clean clustering is 8 approx for ?lt1/4
Consequence Just need to produce a
d-clean clustering for dlt1/4
Approach
1) Clustering of a special form Opt(d)
Clusters either d clean or singletons
Not many more mistakes than Opt
2) Produce something close to Opt(d)

Bad News May not be possible !!
23
Existence of OPT(d)
Opt
24
Existence of OPT(d)

Identify d/3-bad vertices

Opt
C1
C2
?/3-bad vertices
25
Existence of OPT(d)

Move d/3-bad vertices out
If many ( d/3) d/3-bad, split

Opt
Opt(?) Singletons and ?-clean clusters DOpt(?)
O(1) DOpt
C1
C2
OPT(d)
Split
26
Main Result
Opt(?)
?-clean
27
Main Result

Guarantee
Non-Singletons 11? clean
Singletons subset of Opt(?)

Opt(?)
?-clean
Algorithm
11?-clean
28
Main Result

Guarantee
Non-Singletons 11? clean
Singletons subset of Opt(?)

Opt(?)
?-clean
Choose ? 1/44 Non-singletons ¼-clean Bound
mistakes among non-singletons by bad trianges
Involving singletons By those of Opt(?)
Algorithm
11?-clean
29
Main Result

Guarantee
Non-Singletons 11? clean
Singletons subset of Opt(?)

Opt(?)
?-clean
Choose ? 1/44 Non-singletons ¼-clean Bound
mistakes among non-singletons by bad trianges
Involving singletons By those of
Opt(?) Approx. ratio 9/d2 8
Algorithm
11?-clean
30
Open Problems

How about a small constant?
Is Dopt 2 Edge disjoint bad triangles
?
Is Dopt 2 Fractional e.d. bad
triangles?
Extend to -1, 0, 1 weights apx to constant or
even log factor?
Extend to weighted case?
Clique Partitioning Problem
Cutting plane algorithms Grotschel et al

31
(No Transcript)
32
(No Transcript)
33
The problem
Harry B.
Harry Bovik
Same - Different
Disagreement
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
Find most consistent clustering
34
Nice features of formulation