Correlation Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Correlation Clustering

Description:

Harry B. Totally consistent: edges inside clusters. edges ... Harry B. Train up f(x)= same/different. Run f on all pairs. Find most consistent clustering ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 41
Provided by: nikhil5
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Correlation Clustering


1
Correlation Clustering
  • Nikhil Bansal
  • Joint Work with Avrim Blum and
  • Shuchi Chawla

2
Clustering
  • Say we want to cluster n objects of some kind
    (documents, images, text strings)
  • But we dont have a meaningful way to project
    into Euclidean space.
  • Idea of Cohen, McCallum, Richman use past data
    to train up f(x,y)same/different.
  • Then run f on all pairs and try to find most
    consistent clustering.

3
The problem
Harry B.
Harry Bovik
H. Bovik
Tom X.
4
The problem
Harry B.

Harry Bovik
Same - Different


H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
5
The problem
Harry B.

Harry Bovik
Same - Different


H. Bovik
Tom X.
  • Totally consistent
  • edges inside clusters
  • edges outside clusters

6
The problem
Harry B.

Harry Bovik
Same - Different

H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
7
The problem
Harry B.

Harry Bovik
Same - Different

Disagreement
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
Find most consistent clustering
8
The problem
Harry B.

Harry Bovik
Same - Different

Disagreement
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
Find most consistent clustering
9
The problem
Harry B.

Harry Bovik
Same - Different

Disagreement
H. Bovik
Tom X.
  • Problem Given a complete graph on n vertices.
  • Each edge labeled or -.
  • Goal is to find partition of vertices as
    consistent as possible with edge labels.
  • Max (agreements) or Min ( disagreements)
  • There is no k of clusters could be
    anything

10
The Problem
  • Noise Removal
  • There is some true clustering. However some edges
    incorrect. Still want to do well.
  • Agnostic Learning
  • No inherent clustering.
  • Try to find the best representation using
    hypothesis with limited representation power.
  • Eg Research communities via collaboration graph

11
Our results
  • Constant-factor approx for minimizing
    disagreements.
  • PTAS for maximizing agreements.
  • Results for random noise case.

12
PTAS for maximizing agreements
  • Easy to get ½ of the edges
  • Goal additive apx of en2 .
  • Standard approach
  • Draw small sample,
  • Guess partition of sample,
  • Compute partition of remainder.
  • Can do directly, or plug into General Property
    Tester of GGR.
  • Running time doubly expl in e, or singly with
    bad exponent.

13
Minimizing Disagreements
  • Goal Get a constant factor approx.
  • Problem Even if we can find a cluster thats as
    good as best cluster of OPT, were headed toward
    O(log n) Set-Cover like analysis

Need a way of lower-bounding Dopt.
14
Lower bounding idea bad triangles
  • Consider



We know any clustering has to disagree with at
least one of these edges.
15
Lower bounding idea bad triangles
  • If several such disjoint, then mistake on each
    one

1


2
5
2 Edge disjoint Bad Triangles (1,2,3), (3,4,5)


4
3

Dopt Edge disjoint bad triangles
(Not Tight)
16
Lower bounding idea bad triangles
  • If several such, then mistake on each one

1


2
5
Edge disjoint Bad Triangles (1,2,3), (3,4,5)


4
3

Dopt Edge disjoint bad triangles
How can we use this ?
17
d-clean Clusters
  • Given a clustering, vertex d-good if few
    disagreements

N-(v) Within C lt dC N(v) Outside C lt dC
C
v is d-good
Similar - Dissimilar
Essentially, N(v) ¼ C(v)
Cluster C d-clean if all v2C are d-good
18
Observation
  • Any ?-clean clustering is 8 approx for ?lt1/4

Idea Charging mistakes to bad triangles
w
Intuitively, enough choices of w for each wrong
edge (u,v)


-
v
u
19
Observation
  • Any ?-clean clustering is 8 approx for ?lt1/4

Idea Charging mistakes to bad triangles
w
Intuitively, enough choices of w for each wrong
edge (u,v) Can find edge-disjoint bad triangles

u


-
-
v
u
Similar argument for ve edges between 2 clusters
20
General Structure of Argument
  • Any ?-clean clustering is 8 approx for ?lt1/4
  • Consequence Just need to produce a
  • d-clean clustering for dlt1/4

21
General Structure of Argument
  • Any ?-clean clustering is 8 approx for ?lt1/4
  • Consequence Just need to produce a
  • d-clean clustering for dlt1/4

Bad News May not be possible !!
22
General Structure of Argument
  • Any ?-clean clustering is 8 approx for ?lt1/4
  • Consequence Just need to produce a
  • d-clean clustering for dlt1/4
  • Approach
  • 1) Clustering of a special form Opt(d)
  • Clusters either d clean or singletons
  • Not many more mistakes than Opt
  • 2) Produce something close to Opt(d)

Bad News May not be possible !!
23
Existence of OPT(d)
Opt
24
Existence of OPT(d)
  • Identify d/3-bad vertices

Opt
C1
C2
?/3-bad vertices
25
Existence of OPT(d)
  • Move d/3-bad vertices out
  • If many ( d/3) d/3-bad, split

Opt
Opt(?) Singletons and ?-clean clusters DOpt(?)
O(1) DOpt
C1
C2
OPT(d)
Split
26
Main Result
Opt(?)
?-clean
27
Main Result
  • Guarantee
  • Non-Singletons 11? clean
  • Singletons subset of Opt(?)

Opt(?)
?-clean
Algorithm
11?-clean
28
Main Result
  • Guarantee
  • Non-Singletons 11? clean
  • Singletons subset of Opt(?)

Opt(?)
?-clean
Choose ? 1/44 Non-singletons ¼-clean Bound
mistakes among non-singletons by bad trianges
Involving singletons By those of Opt(?)
Algorithm
11?-clean
29
Main Result
  • Guarantee
  • Non-Singletons 11? clean
  • Singletons subset of Opt(?)

Opt(?)
?-clean
Choose ? 1/44 Non-singletons ¼-clean Bound
mistakes among non-singletons by bad trianges
Involving singletons By those of
Opt(?) Approx. ratio 9/d2 8
Algorithm
11?-clean
30
Open Problems
  • How about a small constant?
  • Is Dopt 2 Edge disjoint bad triangles
    ?
  • Is Dopt 2 Fractional e.d. bad
    triangles?
  • Extend to -1, 0, 1 weights apx to constant or
    even log factor?
  • Extend to weighted case?
  • Clique Partitioning Problem
  • Cutting plane algorithms Grotschel et al

31
(No Transcript)
32
(No Transcript)
33
The problem
Harry B.
Harry Bovik
Same - Different
Disagreement
H. Bovik
Tom X.
Train up f(x) same/different
Run f on all pairs
Find most consistent clustering
34
Nice features of formulation
  • Theres no k. (OPT can have anywhere from 1 to n
    clusters)







  • If a perfect solution exists, then its easy to
    find C(v) N (v).
  • Easy to get agreement on ½ of edges.

35
Algorithm
  • Pick vertex v. Let C(v) N (v)
  • Modify C(v)
  • Remove 3d-bad vertices from C(v).
  • Add 7d good vertices into C(v).
  • Delete C(v). Repeat until done, or above always
    makes empty clusters.
  • Output nodes left as singletons.


36
Step 1
Choose v, C neighbors of v
C1
C2
37
Step 2
Vertex Removal Phase If x is 3d bad, CC-x
C1
C2
v
C
  • No vertex in C1 removed.
  • All vertices in C2 removed

38
Step 3
Vertex Addition Phase Add 7d-good vertices to C
C1
C2
v
C
  • All remaining vertices in C1 will be added
  • None in C2 added
  • Cluster C is 11d-clean

39
Case 2 v Singleton in OPT(?)
Choose v, C 1 neighbors of v
C
v
Same idea works
40
The problem
Write a Comment
User Comments (0)
About PowerShow.com