Correlation Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

Correlation Clustering

Description:

Dopt Maximum fractional packing of erroneous triangles ... If several edge-disjoint erroneous s, then any clustering makes a mistake on each one ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 27

Provided by: Shuchi2

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Correlation Clustering

1
Correlation Clustering

Shuchi Chawla
Carnegie Mellon University
Joint work with
Nikhil Bansal and Avrim Blum

2
Natural Language Processing
Co-reference Analysis

In order to understand the article automatically,
need to figure out which entities are one and the
same
Is his in the second line the same person as
The secretary in the first line?

3
Other real-world clustering problems

Web Document Clustering
Given a bunch of documents, classify them into
salient topics
Computer Vision
Distinguish boundaries between different objects
and the background in a picture
Research Communities
Given data on research papers, divide
researchers into communities by co-authorship
Authorship (Citeseer/DBLP)
Given authors of documents, figure out which
authors are really the same person

4
Traditional Approaches to Clustering

Approximation algorithms
k-means, k-median, k-min sum
Matrix methods
Spectral Clustering
AI techniques
EM, single-linkage, classification algorithms

5
Issues with traditional approaches

Dependence on underlying metric
Objective functions are meaningful only on a
metric eg. k-means
Some algorithms work only for specific metrics
(such as Euclidean)
Problem
No well-defined similarity metric
inconsistencies in beliefs

6
Issues with traditional approaches

Fixed number of clusters/known topics
Meaningless without prespecified number of
clusters
eg. for k-means or k-median, if k is
unspecified, it is best to put everything in
their own cluster
Problem
Number of clusters is usually unknown
No predefined topics desirable to figure them
out as part of the algorithm

7
Issues with traditional approaches

No clean notion of quality of clustering
Approximations do not directly translate to how
many items have been grouped wrongly
Reliance on generative model
eg. Data arising from a mixture of Gaussians
Typically dont work well in the case of fuzzy
boundaries
Problem
Fuzzy boundaries how to cluster may depends
on the given set of objects

8
Cohen, McCallum Richmans idea

Learn a similarity function based on context
f(x,y) amount of similarity between x and y
Not necessarily a metric!
Use labeled data to train up this function
Classify all pairs with the learned function
Find the clustering that agrees most with the
function
Problem divided into two separate phases
We deal with the second phase

9
Cohen, McCallum Richmans idea
Learn a similarity measure based on context
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
he
Saddam Hussein
10
A good clustering
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
he
Saddam Hussein

Consistent clustering
edges inside clusters
edges between clusters

11
A good clustering
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
he
Saddam Hussein

Consistent clustering
edges inside clusters
edges between clusters

Inconsistencies or mistakes
12
A good clustering
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
No consistent clustering!
he
Saddam Hussein
Goal Find the most consistent clustering
Mistakes
13
Compared to traditional approaches

Do not have to specify k
Number of clusters can range from 1 to n
No condition on weights can be arbitrary
Clean notion of quality of clustering number of
examples where the clustering differs from f
If a good (perfect) clustering exists, it is easy
to find

14
From a Machine Learning perspective

Noise Removal
There is some true classification function f
But there are a few errors in the data
We want to find the true function
Agnostic Learning
There is no inherent clustering
Try to find the best representation using a
hypothesis with limited expressivity

15
Correlation Clustering

Given a graph with positive (similar) and
negative (dissimilar) edges, find the most
consistent clustering
NP-hard Bansal, Blum, C, FOCS02
Two natural objectives
Maximize agreements
( of ve inside clusters) ( of ve between
clusters)
Minimize disagreements
( of ve between clusters) ( of ve inside
clusters)
Equivalent at optimality, but different in terms
of approximation

16
Overview of results

Minimizing disagreements
Unweighted complete graph O(1) Bansal Blum C
02
4 Charikar et al 03
Weighted general graph O(log n) Charikar et al
03
Demaine et al 03 Emmanuel et
al 03
APX-hardness for weighted case Bansal Blum
C02
constant lower bounds for both cases Charikar
et al 03
Maximizing agreements
Unweighted complete graph PTAS Bansal Blum C
02
Weighted general graphs 0.7664 Charikar et al
03
0.7666 Swamy 04
constant lower bound for weighted case
Charikar et al 03

This talk
17
Minimizing Disagreements Bansal, Blum, C,
FOCS02

Goal approximately minimize number of mistakes
Assumption The graph is unweighted and complete
A lower bound on OPT Erroneous Triangles

Consider
-

Erroneous Triangle

Any clustering disagrees with at least one of
these edges
If several edge-disjoint erroneous ?s, then
any clustering makes a mistake on each one
Dopt ? Maximum fractional packing of erroneous
triangles
18
Using the lower bound ?-clean clusters

Relating erroneous triangles to mistakes
In special cases, we can charge-off
disagreements to erroneous triangles
clean clusters
each vertex has few disagreements incident on it
few is relative to the size of the cluster
of disagreements ¼ of erroneous triangles

good vertex
Clean cluster ? All vertices are good
bad vertex
19
Using the lower bound ?-clean clusters

Relating erroneous triangles to mistakes
In special cases, we can charge-off
disagreements to erroneous triangles
?-clean clusters
each vertex in cluster C has fewer than ?C
positive and
?C negative mistakes
? ? ¼ ? of disagreements ¼ of erroneous
triangles
A high density of positive edges
We can easily spot them in the graph
Possible solution Find a ?-clean clustering, and
charge disagreements to erroneous triangles
Caveat It may not exist

20
Using the lower bound ?-clean clusters

Caveat A d-clean clustering may not exist
An almost-?-clean clustering
All clusters are either ?-clean or contain a
single node
An almost-?-clean clustering always exists
trivially

We show
? an almost-?-clean clustering that is almost as
good as OPT
Nice structure helps us find it easily.

OPT(?)
21
OPT(?) clean or singleton
Optimal Clustering
bad vertices
Imaginary Procedure
Few (? ? fraction) bad nodes ? remove them
from cluster
New cluster O(?)-clean few new mistakes
? by a 1/? factor
22
OPT(?) clean or singleton
Optimal Clustering
bad vertices
Imaginary Procedure
OPT(?) All clusters are ?-clean
or singleton
Many (? ? fraction) bad nodes ? break up
cluster
New singleton clusters mistakes ? by a
1/?2 factor
Few new mistakes
23
Our algorithm

Goal Find nearly clean clusters
Pick an arbitrary vertex v C ? green (ve)
neighbors of v
Remove any bad vertices from C
Add vertices that are good w.r.t. C
Output C and recurse on the remaining graph
If C is empty for all choices of v, output
remaining vertices as singletons

24
Finding clean clusters
OPT(?)
Charging-off mistakes 1. Mistakes among clean
clusters - charge to erron. ?s 2. Mistakes
among singletons - no more than
corresponding mistakes in OPT(?)
ALG
O(?)-clean clusters
? constant factor approximation
25
Maximizing Agreements

Easy to obtain a 2-approximation
If (pos. edges) gt (neg. edges)
put everything in one
cluster
Otherwise, n singleton clusters
Get at least half the edges correct
Max score possible total number of edges
2-approximation

26
Maximizing Agreements

Max possible score ½n2
Goal obtain an additive approx of en2
Standard approach
Draw small sample
Guess partition of sample
Compute partition of remainder
Running time doubly expl in e, or singly with
bad exponent.

27
Experimental Results Wellner McCallum03
Dataset 3
Dataset 2
Dataset 1
70.41
88.83
90.98
Best-previous-match
60.83
88.90
91.65
Single-link-threshold
73.42
91.59
93.96
Correlation clustering
10
24
28
age error reduction over previous best
(age Accuracy of classification)
28
Future Directions

Better combinatorial approximation
A good iterative approximation
on few changes to the graph, quickly recompute a
good clustering
Minimizing Correlation
number of agreements number of disagreements
log-approx known can we get a constant factor
approx?

29
Questions?
30
Future Directions

Clustering with small clusters
Given that all clusters in OPT have size at most
k, find a good approximation
Is this NP-hard?
Different from finding best clustering with small
clusters, without guarantee on OPT
Clustering with few clusters
Given that OPT has at most k clusters, find an
approximation
Maximizing Correlation
number of agreements number of disagreements
Can we get a constant factor approximation?

31
Lower Bounding Idea Erroneous Triangles
If several edge-disjoint erroneous ?s, then
any clustering makes a mistake on each one

-

1
2
2 Edge disjoint erroneous triangles (1,2,3),
(1,4,5)
5
3 mistakes
4
3
Dopt ? Maximum fractional packing of erroneous
triangles
32
Open Problems

Clustering with small clusters
In most applications, clusters are very small
Given that all clusters in OPT have size at most
k, find a good approximation
Different from finding best clustering with small
clusters, without guarantee on OPT
Optimal solution for unweighted graphs?
A possible approach
Any two vertices in the same cluster in OPT are
neighbors or share a common neighbor.
We can find a list O(n2k) clusters, such that all
OPTs clusters are in this list
When k is small, only polynomially many choices
to pick from

33
Open Problems

Clustering with few clusters
Given that OPT has at most k clusters, find an
approximation
Consensus clustering
Given a sum of k clusterings find best
consensus clustering
easy 2-approximation can we get a PTAS?
Maximizing Correlation
number of agreements number of disagreements
bad case of disagree constant fraction of
total weight
Charikar Wirth obtained a constant factor
approximation
Can we get a PTAS in unweighted graphs?

34
Overview of results
Min Disagree
Max Agree
O(1) Bansal Blum C 02
PTAS Bansal Blum C 02
Unweighted (complete) graphs
4 Charikar Guruswami Wirth 03
APX-hard CGW 03
1.3048
CGW 03
O(log n)
Emanuel Fiat 03
Immorlica Demaine 03
1.3044 Swamy 04
Weighted graphs
Charikar Guruswami Wirth 03
1.0087
29/28
CGW 03
CGW 03
35
Typical characteristics