Title: Correlation Clustering
1Correlation Clustering
- Shuchi Chawla
- Carnegie Mellon University
- Joint work with
- Nikhil Bansal and Avrim Blum
2Natural Language Processing
Co-reference Analysis
- In order to understand the article automatically,
need to figure out which entities are one and the
same - Is his in the second line the same person as
The secretary in the first line?
3Other real-world clustering problems
- Web Document Clustering
- Given a bunch of documents, classify them into
salient topics - Computer Vision
- Distinguish boundaries between different objects
and the background in a picture - Research Communities
- Given data on research papers, divide
researchers into communities by co-authorship - Authorship (Citeseer/DBLP)
- Given authors of documents, figure out which
authors are really the same person
4Traditional Approaches to Clustering
- Approximation algorithms
- k-means, k-median, k-min sum
- Matrix methods
- Spectral Clustering
- AI techniques
- EM, single-linkage, classification algorithms
5Issues with traditional approaches
- Dependence on underlying metric
- Objective functions are meaningful only on a
metric eg. k-means - Some algorithms work only for specific metrics
(such as Euclidean) - Problem
- No well-defined similarity metric
- inconsistencies in beliefs
6Issues with traditional approaches
- Fixed number of clusters/known topics
- Meaningless without prespecified number of
clusters - eg. for k-means or k-median, if k is
unspecified, it is best to put everything in
their own cluster -
- Problem
- Number of clusters is usually unknown
- No predefined topics desirable to figure them
out as part of the algorithm
7Issues with traditional approaches
- No clean notion of quality of clustering
- Approximations do not directly translate to how
many items have been grouped wrongly - Reliance on generative model
- eg. Data arising from a mixture of Gaussians
- Typically dont work well in the case of fuzzy
boundaries - Problem
- Fuzzy boundaries how to cluster may depends
on the given set of objects
8Cohen, McCallum Richmans idea
- Learn a similarity function based on context
- f(x,y) amount of similarity between x and y
- Not necessarily a metric!
- Use labeled data to train up this function
- Classify all pairs with the learned function
- Find the clustering that agrees most with the
function - Problem divided into two separate phases
- We deal with the second phase
9Cohen, McCallum Richmans idea
Learn a similarity measure based on context
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
he
Saddam Hussein
10A good clustering
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
he
Saddam Hussein
- Consistent clustering
- edges inside clusters
- edges between clusters
11A good clustering
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
he
Saddam Hussein
- Consistent clustering
- edges inside clusters
- edges between clusters
Inconsistencies or mistakes
12A good clustering
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
No consistent clustering!
he
Saddam Hussein
Goal Find the most consistent clustering
Mistakes
13Compared to traditional approaches
- Do not have to specify k
- Number of clusters can range from 1 to n
- No condition on weights can be arbitrary
- Clean notion of quality of clustering number of
examples where the clustering differs from f - If a good (perfect) clustering exists, it is easy
to find
14From a Machine Learning perspective
- Noise Removal
- There is some true classification function f
- But there are a few errors in the data
- We want to find the true function
- Agnostic Learning
- There is no inherent clustering
- Try to find the best representation using a
hypothesis with limited expressivity
15Correlation Clustering
- Given a graph with positive (similar) and
negative (dissimilar) edges, find the most
consistent clustering - NP-hard Bansal, Blum, C, FOCS02
- Two natural objectives
- Maximize agreements
- ( of ve inside clusters) ( of ve between
clusters) - Minimize disagreements
- ( of ve between clusters) ( of ve inside
clusters) - Equivalent at optimality, but different in terms
of approximation
16Overview of results
- Minimizing disagreements
- Unweighted complete graph O(1) Bansal Blum C
02 - 4 Charikar et al 03
- Weighted general graph O(log n) Charikar et al
03 - Demaine et al 03 Emmanuel et
al 03 - APX-hardness for weighted case Bansal Blum
C02 - constant lower bounds for both cases Charikar
et al 03 - Maximizing agreements
- Unweighted complete graph PTAS Bansal Blum C
02 - Weighted general graphs 0.7664 Charikar et al
03 - 0.7666 Swamy 04
- constant lower bound for weighted case
Charikar et al 03
This talk
17Minimizing Disagreements Bansal, Blum, C,
FOCS02
- Goal approximately minimize number of mistakes
- Assumption The graph is unweighted and complete
- A lower bound on OPT Erroneous Triangles
Consider
-
Erroneous Triangle
Any clustering disagrees with at least one of
these edges
If several edge-disjoint erroneous ?s, then
any clustering makes a mistake on each one
Dopt ? Maximum fractional packing of erroneous
triangles
18Using the lower bound ?-clean clusters
- Relating erroneous triangles to mistakes
- In special cases, we can charge-off
disagreements to erroneous triangles - clean clusters
- each vertex has few disagreements incident on it
- few is relative to the size of the cluster
- of disagreements ¼ of erroneous triangles
good vertex
Clean cluster ? All vertices are good
bad vertex
19Using the lower bound ?-clean clusters
- Relating erroneous triangles to mistakes
- In special cases, we can charge-off
disagreements to erroneous triangles - ?-clean clusters
- each vertex in cluster C has fewer than ?C
positive and - ?C negative mistakes
- ? ? ¼ ? of disagreements ¼ of erroneous
triangles - A high density of positive edges
- We can easily spot them in the graph
- Possible solution Find a ?-clean clustering, and
charge disagreements to erroneous triangles - Caveat It may not exist
20Using the lower bound ?-clean clusters
- Caveat A d-clean clustering may not exist
- An almost-?-clean clustering
- All clusters are either ?-clean or contain a
single node - An almost-?-clean clustering always exists
trivially
- We show
- ? an almost-?-clean clustering that is almost as
good as OPT - Nice structure helps us find it easily.
OPT(?)
21OPT(?) clean or singleton
Optimal Clustering
bad vertices
Imaginary Procedure
Few (? ? fraction) bad nodes ? remove them
from cluster
New cluster O(?)-clean few new mistakes
? by a 1/? factor
22OPT(?) clean or singleton
Optimal Clustering
bad vertices
Imaginary Procedure
OPT(?) All clusters are ?-clean
or singleton
Many (? ? fraction) bad nodes ? break up
cluster
New singleton clusters mistakes ? by a
1/?2 factor
Few new mistakes
23Our algorithm
- Goal Find nearly clean clusters
- Pick an arbitrary vertex v C ? green (ve)
neighbors of v - Remove any bad vertices from C
- Add vertices that are good w.r.t. C
- Output C and recurse on the remaining graph
- If C is empty for all choices of v, output
remaining vertices as singletons
24Finding clean clusters
OPT(?)
Charging-off mistakes 1. Mistakes among clean
clusters - charge to erron. ?s 2. Mistakes
among singletons - no more than
corresponding mistakes in OPT(?)
ALG
O(?)-clean clusters
? constant factor approximation
25Maximizing Agreements
- Easy to obtain a 2-approximation
- If (pos. edges) gt (neg. edges)
- put everything in one
cluster - Otherwise, n singleton clusters
- Get at least half the edges correct
- Max score possible total number of edges
- 2-approximation
26Maximizing Agreements
- Max possible score ½n2
- Goal obtain an additive approx of en2
- Standard approach
- Draw small sample
- Guess partition of sample
- Compute partition of remainder
- Running time doubly expl in e, or singly with
bad exponent.
27Experimental Results Wellner McCallum03
Dataset 3
Dataset 2
Dataset 1
70.41
88.83
90.98
Best-previous-match
60.83
88.90
91.65
Single-link-threshold
73.42
91.59
93.96
Correlation clustering
10
24
28
age error reduction over previous best
(age Accuracy of classification)
28Future Directions
- Better combinatorial approximation
- A good iterative approximation
- on few changes to the graph, quickly recompute a
good clustering - Minimizing Correlation
- number of agreements number of disagreements
- log-approx known can we get a constant factor
approx?
29Questions?
30Future Directions
- Clustering with small clusters
- Given that all clusters in OPT have size at most
k, find a good approximation - Is this NP-hard?
- Different from finding best clustering with small
clusters, without guarantee on OPT - Clustering with few clusters
- Given that OPT has at most k clusters, find an
approximation - Maximizing Correlation
- number of agreements number of disagreements
- Can we get a constant factor approximation?
31Lower Bounding Idea Erroneous Triangles
If several edge-disjoint erroneous ?s, then
any clustering makes a mistake on each one
-
1
2
2 Edge disjoint erroneous triangles (1,2,3),
(1,4,5)
5
3 mistakes
4
3
Dopt ? Maximum fractional packing of erroneous
triangles
32Open Problems
- Clustering with small clusters
- In most applications, clusters are very small
- Given that all clusters in OPT have size at most
k, find a good approximation - Different from finding best clustering with small
clusters, without guarantee on OPT - Optimal solution for unweighted graphs?
A possible approach - Any two vertices in the same cluster in OPT are
neighbors or share a common neighbor. - We can find a list O(n2k) clusters, such that all
OPTs clusters are in this list - When k is small, only polynomially many choices
to pick from
33Open Problems
- Clustering with few clusters
- Given that OPT has at most k clusters, find an
approximation - Consensus clustering
- Given a sum of k clusterings find best
consensus clustering - easy 2-approximation can we get a PTAS?
- Maximizing Correlation
- number of agreements number of disagreements
- bad case of disagree constant fraction of
total weight - Charikar Wirth obtained a constant factor
approximation - Can we get a PTAS in unweighted graphs?
34Overview of results
Min Disagree
Max Agree
O(1) Bansal Blum C 02
PTAS Bansal Blum C 02
Unweighted (complete) graphs
4 Charikar Guruswami Wirth 03
APX-hard CGW 03
1.3048
CGW 03
O(log n)
Emanuel Fiat 03
Immorlica Demaine 03
1.3044 Swamy 04
Weighted graphs
Charikar Guruswami Wirth 03
1.0087
29/28
CGW 03
CGW 03
35Typical characteristics
- No well-defined similarity metric
- inconsistencies in beliefs
- Number of clusters is unknown
- No predefined topics
- desirable to figure them out as part of the
algorithm - Fuzzy boundaries how to cluster may depends on
the given set of objects