Title: Correlation Clustering
1Correlation Clustering
- Nikhil Bansal
- Joint Work with Avrim Blum and
- Shuchi Chawla
2Clustering
- Say we want to cluster n objects of some kind
(documents, images, text strings) - But we dont have a meaningful way to project
into Euclidean space. - Idea of Cohen, McCallum, Richman use past data
to train up f(x,y)same/different. - Then run f on all pairs and try to find most
consistent clustering.
3The problem
Harry B.
Harry Bovik
H. Bovik
Tom X.
4The problem
Harry B.
Harry Bovik
Same - Different
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
5The problem
Harry B.
Harry Bovik
Same - Different
H. Bovik
Tom X.
- Totally consistent
- edges inside clusters
- edges outside clusters
6The problem
Harry B.
Harry Bovik
Same - Different
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
7The problem
Harry B.
Harry Bovik
Same - Different
Disagreement
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
Find most consistent clustering
8The problem
Harry B.
Harry Bovik
Same - Different
Disagreement
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
Find most consistent clustering
9The problem
Harry B.
Harry Bovik
Same - Different
Disagreement
H. Bovik
Tom X.
- Problem Given a complete graph on n vertices.
- Each edge labeled or -.
- Goal is to find partition of vertices as
consistent as possible with edge labels. - Max (agreements) or Min ( disagreements)
- There is no k of clusters could be
anything
10The Problem
- Noise Removal
- There is some true clustering. However some edges
incorrect. Still want to do well. - Agnostic Learning
- No inherent clustering.
- Try to find the best representation using
hypothesis with limited representation power. - Eg Research communities via collaboration graph
11Our results
- Constant-factor approx for minimizing
disagreements. - PTAS for maximizing agreements.
- Results for random noise case.
12PTAS for maximizing agreements
- Easy to get ½ of the edges
- Goal additive apx of en2 .
- Standard approach
- Draw small sample,
- Guess partition of sample,
- Compute partition of remainder.
- Can do directly, or plug into General Property
Tester of GGR. - Running time doubly expl in e, or singly with
bad exponent.
13Minimizing Disagreements
- Goal Get a constant factor approx.
- Problem Even if we can find a cluster thats as
good as best cluster of OPT, were headed toward
O(log n) Set-Cover like analysis -
Need a way of lower-bounding Dopt.
14Lower bounding idea bad triangles
We know any clustering has to disagree with at
least one of these edges.
15Lower bounding idea bad triangles
- If several such disjoint, then mistake on each
one -
1
2
5
2 Edge disjoint Bad Triangles (1,2,3), (3,4,5)
4
3
Dopt Edge disjoint bad triangles
(Not Tight)
16Lower bounding idea bad triangles
- If several such, then mistake on each one
-
1
2
5
Edge disjoint Bad Triangles (1,2,3), (3,4,5)
4
3
Dopt Edge disjoint bad triangles
How can we use this ?
17d-clean Clusters
- Given a clustering, vertex d-good if few
disagreements
N-(v) Within C lt dC N(v) Outside C lt dC
C
v is d-good
Similar - Dissimilar
Essentially, N(v) ¼ C(v)
Cluster C d-clean if all v2C are d-good
18Observation
- Any ?-clean clustering is 8 approx for ?lt1/4
Idea Charging mistakes to bad triangles
w
Intuitively, enough choices of w for each wrong
edge (u,v)
-
v
u
19Observation
- Any ?-clean clustering is 8 approx for ?lt1/4
Idea Charging mistakes to bad triangles
w
Intuitively, enough choices of w for each wrong
edge (u,v) Can find edge-disjoint bad triangles
u
-
-
v
u
Similar argument for ve edges between 2 clusters
20General Structure of Argument
- Any ?-clean clustering is 8 approx for ?lt1/4
- Consequence Just need to produce a
- d-clean clustering for dlt1/4
21General Structure of Argument
- Any ?-clean clustering is 8 approx for ?lt1/4
- Consequence Just need to produce a
- d-clean clustering for dlt1/4
Bad News May not be possible !!
22General Structure of Argument
- Any ?-clean clustering is 8 approx for ?lt1/4
- Consequence Just need to produce a
- d-clean clustering for dlt1/4
- Approach
- 1) Clustering of a special form Opt(d)
- Clusters either d clean or singletons
- Not many more mistakes than Opt
- 2) Produce something close to Opt(d)
Bad News May not be possible !!
23Existence of OPT(d)
Opt
24Existence of OPT(d)
- Identify d/3-bad vertices
Opt
C1
C2
?/3-bad vertices
25Existence of OPT(d)
- Move d/3-bad vertices out
- If many ( d/3) d/3-bad, split
Opt
Opt(?) Singletons and ?-clean clusters DOpt(?)
O(1) DOpt
C1
C2
OPT(d)
Split
26Main Result
Opt(?)
?-clean
27Main Result
- Guarantee
- Non-Singletons 11? clean
- Singletons subset of Opt(?)
Opt(?)
?-clean
Algorithm
11?-clean
28Main Result
- Guarantee
- Non-Singletons 11? clean
- Singletons subset of Opt(?)
Opt(?)
?-clean
Choose ? 1/44 Non-singletons ¼-clean Bound
mistakes among non-singletons by bad trianges
Involving singletons By those of Opt(?)
Algorithm
11?-clean
29Main Result
- Guarantee
- Non-Singletons 11? clean
- Singletons subset of Opt(?)
Opt(?)
?-clean
Choose ? 1/44 Non-singletons ¼-clean Bound
mistakes among non-singletons by bad trianges
Involving singletons By those of
Opt(?) Approx. ratio 9/d2 8
Algorithm
11?-clean
30Open Problems
- How about a small constant?
- Is Dopt 2 Edge disjoint bad triangles
? - Is Dopt 2 Fractional e.d. bad
triangles? -
- Extend to -1, 0, 1 weights apx to constant or
even log factor? - Extend to weighted case?
- Clique Partitioning Problem
- Cutting plane algorithms Grotschel et al
31(No Transcript)
32(No Transcript)
33The problem
Harry B.
Harry Bovik
Same - Different
Disagreement
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
Find most consistent clustering
34Nice features of formulation
- Theres no k. (OPT can have anywhere from 1 to n
clusters)
- If a perfect solution exists, then its easy to
find C(v) N (v). - Easy to get agreement on ½ of edges.
35Algorithm
- Pick vertex v. Let C(v) N (v)
- Modify C(v)
- Remove 3d-bad vertices from C(v).
- Add 7d good vertices into C(v).
- Delete C(v). Repeat until done, or above always
makes empty clusters. - Output nodes left as singletons.
36Step 1
Choose v, C neighbors of v
C1
C2
37Step 2
Vertex Removal Phase If x is 3d bad, CC-x
C1
C2
v
C
- No vertex in C1 removed.
- All vertices in C2 removed
38Step 3
Vertex Addition Phase Add 7d-good vertices to C
C1
C2
v
C
- All remaining vertices in C1 will be added
- None in C2 added
- Cluster C is 11d-clean
39Case 2 v Singleton in OPT(?)
Choose v, C 1 neighbors of v
C
v
Same idea works
40The problem