Title: Correlation Clustering
1Correlation Clustering
- Shuchi Chawla
- Carnegie Mellon University
- Joint work with
- Nikhil Bansal and Avrim Blum
2Document Clustering
- Given a bunch of documents, classify them into
salient topics - Typical characteristics
- No well-defined similarity metric
- Number of clusters is unknown
- No predefined topics desirable to figure them
out as part of the algorithm
3Research Communities
- Given data on research papers, divide researchers
into communities by co-authorship - Typical characteristics
- How to divide really depends on the given set of
researchers - Fuzzy boundaries
4Traditional Approaches to Clustering
- Approximation algorithms
- k-means, k-median, k-min sum
- Matrix methods
- Spectral Clustering
- AI techniques
- EM, classification algorithms
5Problems with traditional approaches
- Dependence on underlying metric
- Objective functions are meaningless without a
metric eg. k-means - Algorithm works only on specific metrics (such as
Euclidean) eg. spectral methods
6Problems with traditional approaches
- Fixed number of clusters
- Meaningless without prespecified number of
clusters - eg. for k-means or k-median, if k is
unspecified, it is best to put everything in
their own cluster
7Problems with traditional approaches
- No clean notion of quality of clustering
- Objective functions do not directly translate to
how many items have been grouped wrongly - Heuristic approaches
- Objective functions derived from generative models
8Cohen, McCallum Richmans idea
- Learn a similarity measure on documents
- may not be a metric!
- f(x,y) amount of similarity between x and y
- Use labeled data to train up this function
- Classify all pairs with the learned function
- Find the most consistent clustering
9An example
Harry B.
Harry Bovik
H. Bovik
Tom X.
- Consistent clustering
- edges inside clusters
- - edges between clusters
10An example
Harry B.
Harry Bovik
Same - Different
H. Bovik
Tom X.
11An example
Harry B.
Harry Bovik
Same - Different
H. Bovik
Tom X.
- Task Find most consistent clustering
- or, fewest possible disagreements
- equivalently, maximum possible agreements
12Correlation clustering
- Given a complete graph
- Each edge labeled or -
- Our measure of clustering How many labels does
it agree with? - Number of clusters depends on the edge labels
- NP-complete We consider approximations
13Compared to traditional approaches
- Do not have to specify k
- No condition on weights can be arbitrary
- Clean notion of quality of clustering number of
examples where the clustering differs from f - If a good (perfect) clustering exists, it is easy
to find
14Some machine learning justification
- Noise Removal
- There is some true classification function f
- But there are a few errors in the data
- We want to find the true function
- Agnostic Learning
- There is no inherent clustering
- Try to find the best representation using a
hypothesis with limited expressivity
15Our results
- Constant factor approximation for minimizing
disagreements - PTAS for maximizing agreements
- Results for the random noise case
16Minimizing Disagreements
- Goal constant approximation
- Problem
- Even if we find a cluster as good as one in
OPT, we are headed towards a log n approximation
(a set-cover like bound) - Idea lower bound DOPT
17Lower Bounding Idea Bad Triangles
Consider
-
Bad Triangle
We know any clustering has to disagree with at
least one of these edges.
18Lower Bounding Idea Bad Triangles
-
If several edge-disjoint bad triangles, then any
clustering makes a mistake on each one
1
2 Edge disjoint Bad Triangles (1,2,3), (1,4,5)
2
5
4
3
Dopt ? Edge disjoint bad triangles
19Using the lower bound
- d-clean cluster cluster C where each node has
fewer than dC bad edges - d-clean clusters have few bad triangles gt few
mistakes - Possible solution find a d-clean clustering
- Caveat It may not exist
20Using the lower bound
- Caveat A d-clean clustering may not exist
- We show ? a clustering with clusters that are
d-clean or singleton - Further, it has few mistakes
- Nice structure helps us find it easily.
21Maximizing Agreements
- Easy to obtain a 2-approximation
- If (pos. edges) gt (neg. edges)
- everything in one cluster
- Otherwise, n singleton clusters
- Get at least half the edges correct
- Max score possible total number of edges
- 2-approximation !
22Maximizing Agreements
- Max possible score ½n2
- Goal obtain an additive approx of en2
- Standard approach
- Draw small sample
- Guess partition of sample
- Compute partition of remainder
- Running time doubly expl in e, or singly with
bad exponent.
23Extensions Open Problems
- Weighted edges or incomplete graph
- Recent work by Bartal et al
- log-approximation based on multiway cut
- Better constant for unweighted case
- Can we use bad triangles (or polygons) more
directly for a tighter bound? - Experimental performance
24Other problems I have worked on
- Game Theory and Mechanism Design
- Approx for Orienteering related problems
- Online search algorithms based on Machine
Learning approaches - Theoretical properties of Power Law graphs
- Currently working on Privacy with Cynthia
25 26using the lower bound delta clean clusters
- give proof that delta clean is \leq 4 opt
27proof outline
- delta clean lt 4opt
- but there may not be a delta clean clustering!!
- show that there is a clustering that is close to
delta clean clusters are either delta clean or
singleton - there exists such a clustering close to opt
- we will try to find this clustering
- (copy nikhils slide)
28existence of opt(delta)
29algorithm
- pictorially use nikhils slides
- brief outline of how it does that
30bounding the cost
- clusters containing opt(delta)s clusters are ¼
clean - rest have at most as many mistakes as opt(delta)
31random noise?