Correlation Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

Correlation Clustering

Description:

Harry Bovik. H. Bovik. Tom X. Harry B. Shuchi Chawla, Carnegie Mellon ... Harry B. Disagreement. Shuchi Chawla, Carnegie Mellon University. 11. An example ... – PowerPoint PPT presentation

Number of Views:248

Avg rating:3.0/5.0

Slides: 24

Provided by: Shuchi2

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Correlation Clustering

1
Correlation Clustering

Shuchi Chawla
Carnegie Mellon University
Joint work with
Nikhil Bansal and Avrim Blum

2
Document Clustering

Given a bunch of documents, classify them into
salient topics
Typical characteristics
No well-defined similarity metric
Number of clusters is unknown
No predefined topics desirable to figure them
out as part of the algorithm

3
Research Communities

Given data on research papers, divide researchers
into communities by co-authorship
Typical characteristics
How to divide really depends on the given set of
researchers
Fuzzy boundaries

4
Traditional Approaches to Clustering

Approximation algorithms
k-means, k-median, k-min sum
Matrix methods
Spectral Clustering
AI techniques
EM, classification algorithms

5
Problems with traditional approaches

Dependence on underlying metric
Objective functions are meaningless without a
metric eg. k-means
Algorithm works only on specific metrics (such as
Euclidean) eg. spectral methods

6
Problems with traditional approaches

Fixed number of clusters
Meaningless without prespecified number of
clusters
eg. for k-means or k-median, if k is
unspecified, it is best to put everything in
their own cluster

7
Problems with traditional approaches

No clean notion of quality of clustering
Objective functions do not directly translate to
how many items have been grouped wrongly
Heuristic approaches
Objective functions derived from generative models

8
Cohen, McCallum Richmans idea

Learn a similarity measure on documents
may not be a metric!
f(x,y) amount of similarity between x and y
Use labeled data to train up this function
Classify all pairs with the learned function
Find the most consistent clustering

9
An example
Harry B.
Harry Bovik
H. Bovik
Tom X.

Consistent clustering
edges inside clusters
- edges between clusters

10
An example
Harry B.
Harry Bovik
Same - Different
H. Bovik
Tom X.
11
An example
Harry B.
Harry Bovik
Same - Different
H. Bovik
Tom X.

Task Find most consistent clustering
or, fewest possible disagreements
equivalently, maximum possible agreements

12
Correlation clustering

Given a complete graph
Each edge labeled or -
Our measure of clustering How many labels does
it agree with?
Number of clusters depends on the edge labels
NP-complete We consider approximations

13
Compared to traditional approaches

Do not have to specify k
No condition on weights can be arbitrary
Clean notion of quality of clustering number of
examples where the clustering differs from f
If a good (perfect) clustering exists, it is easy
to find

14
Some machine learning justification

Noise Removal
There is some true classification function f
But there are a few errors in the data
We want to find the true function
Agnostic Learning
There is no inherent clustering
Try to find the best representation using a
hypothesis with limited expressivity

15
Our results

Constant factor approximation for minimizing
disagreements
PTAS for maximizing agreements
Results for the random noise case

16
Minimizing Disagreements

Goal constant approximation
Problem
Even if we find a cluster as good as one in
OPT, we are headed towards a log n approximation
(a set-cover like bound)
Idea lower bound DOPT

17
Lower Bounding Idea Bad Triangles
Consider

-
Bad Triangle

We know any clustering has to disagree with at
least one of these edges.
18
Lower Bounding Idea Bad Triangles
-
If several edge-disjoint bad triangles, then any
clustering makes a mistake on each one

1
2 Edge disjoint Bad Triangles (1,2,3), (1,4,5)
2
5
4
3
Dopt ? Edge disjoint bad triangles
19
Using the lower bound

d-clean cluster cluster C where each node has
fewer than dC bad edges
d-clean clusters have few bad triangles gt few
mistakes
Possible solution find a d-clean clustering
Caveat It may not exist

20
Using the lower bound

Caveat A d-clean clustering may not exist

We show ? a clustering with clusters that are
d-clean or singleton
Further, it has few mistakes
Nice structure helps us find it easily.

21
Maximizing Agreements

Easy to obtain a 2-approximation
If (pos. edges) gt (neg. edges)
everything in one cluster
Otherwise, n singleton clusters
Get at least half the edges correct
Max score possible total number of edges
2-approximation !

22
Maximizing Agreements

Max possible score ½n2
Goal obtain an additive approx of en2
Standard approach
Draw small sample
Guess partition of sample
Compute partition of remainder
Running time doubly expl in e, or singly with
bad exponent.

23
Extensions Open Problems

Weighted edges or incomplete graph
Recent work by Bartal et al
log-approximation based on multiway cut
Better constant for unweighted case
Can we use bad triangles (or polygons) more
directly for a tighter bound?
Experimental performance

24
Other problems I have worked on

Game Theory and Mechanism Design
Approx for Orienteering related problems
Online search algorithms based on Machine
Learning approaches
Theoretical properties of Power Law graphs
Currently working on Privacy with Cynthia

Thanks!

26
using the lower bound delta clean clusters

give proof that delta clean is \leq 4 opt

27
proof outline

delta clean lt 4opt
but there may not be a delta clean clustering!!
show that there is a clustering that is close to
delta clean clusters are either delta clean or
singleton
there exists such a clustering close to opt
we will try to find this clustering
(copy nikhils slide)

28
existence of opt(delta)