Correlation Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Correlation Clustering

Description:

Harry Bovik. H. Bovik. Tom X. Harry B. Shuchi Chawla, Carnegie Mellon ... Harry B. Disagreement. Shuchi Chawla, Carnegie Mellon University. 11. An example ... – PowerPoint PPT presentation

Number of Views:248
Avg rating:3.0/5.0
Slides: 24
Provided by: Shuchi2
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Correlation Clustering


1
Correlation Clustering
  • Shuchi Chawla
  • Carnegie Mellon University
  • Joint work with
  • Nikhil Bansal and Avrim Blum

2
Document Clustering
  • Given a bunch of documents, classify them into
    salient topics
  • Typical characteristics
  • No well-defined similarity metric
  • Number of clusters is unknown
  • No predefined topics desirable to figure them
    out as part of the algorithm

3
Research Communities
  • Given data on research papers, divide researchers
    into communities by co-authorship
  • Typical characteristics
  • How to divide really depends on the given set of
    researchers
  • Fuzzy boundaries

4
Traditional Approaches to Clustering
  • Approximation algorithms
  • k-means, k-median, k-min sum
  • Matrix methods
  • Spectral Clustering
  • AI techniques
  • EM, classification algorithms

5
Problems with traditional approaches
  • Dependence on underlying metric
  • Objective functions are meaningless without a
    metric eg. k-means
  • Algorithm works only on specific metrics (such as
    Euclidean) eg. spectral methods

6
Problems with traditional approaches
  • Fixed number of clusters
  • Meaningless without prespecified number of
    clusters
  • eg. for k-means or k-median, if k is
    unspecified, it is best to put everything in
    their own cluster

7
Problems with traditional approaches
  • No clean notion of quality of clustering
  • Objective functions do not directly translate to
    how many items have been grouped wrongly
  • Heuristic approaches
  • Objective functions derived from generative models

8
Cohen, McCallum Richmans idea
  • Learn a similarity measure on documents
  • may not be a metric!
  • f(x,y) amount of similarity between x and y
  • Use labeled data to train up this function
  • Classify all pairs with the learned function
  • Find the most consistent clustering

9
An example
Harry B.
Harry Bovik
H. Bovik
Tom X.
  • Consistent clustering
  • edges inside clusters
  • - edges between clusters

10
An example
Harry B.
Harry Bovik
Same - Different
H. Bovik
Tom X.
11
An example
Harry B.
Harry Bovik
Same - Different
H. Bovik
Tom X.
  • Task Find most consistent clustering
  • or, fewest possible disagreements
  • equivalently, maximum possible agreements

12
Correlation clustering
  • Given a complete graph
  • Each edge labeled or -
  • Our measure of clustering How many labels does
    it agree with?
  • Number of clusters depends on the edge labels
  • NP-complete We consider approximations

13
Compared to traditional approaches
  • Do not have to specify k
  • No condition on weights can be arbitrary
  • Clean notion of quality of clustering number of
    examples where the clustering differs from f
  • If a good (perfect) clustering exists, it is easy
    to find

14
Some machine learning justification
  • Noise Removal
  • There is some true classification function f
  • But there are a few errors in the data
  • We want to find the true function
  • Agnostic Learning
  • There is no inherent clustering
  • Try to find the best representation using a
    hypothesis with limited expressivity

15
Our results
  • Constant factor approximation for minimizing
    disagreements
  • PTAS for maximizing agreements
  • Results for the random noise case

16
Minimizing Disagreements
  • Goal constant approximation
  • Problem
  • Even if we find a cluster as good as one in
    OPT, we are headed towards a log n approximation
    (a set-cover like bound)
  • Idea lower bound DOPT

17
Lower Bounding Idea Bad Triangles
Consider

-
Bad Triangle

We know any clustering has to disagree with at
least one of these edges.
18
Lower Bounding Idea Bad Triangles
-
If several edge-disjoint bad triangles, then any
clustering makes a mistake on each one


1
2 Edge disjoint Bad Triangles (1,2,3), (1,4,5)
2
5
4
3
Dopt ? Edge disjoint bad triangles
19
Using the lower bound
  • d-clean cluster cluster C where each node has
    fewer than dC bad edges
  • d-clean clusters have few bad triangles gt few
    mistakes
  • Possible solution find a d-clean clustering
  • Caveat It may not exist

20
Using the lower bound
  • Caveat A d-clean clustering may not exist
  • We show ? a clustering with clusters that are
    d-clean or singleton
  • Further, it has few mistakes
  • Nice structure helps us find it easily.

21
Maximizing Agreements
  • Easy to obtain a 2-approximation
  • If (pos. edges) gt (neg. edges)
  • everything in one cluster
  • Otherwise, n singleton clusters
  • Get at least half the edges correct
  • Max score possible total number of edges
  • 2-approximation !

22
Maximizing Agreements
  • Max possible score ½n2
  • Goal obtain an additive approx of en2
  • Standard approach
  • Draw small sample
  • Guess partition of sample
  • Compute partition of remainder
  • Running time doubly expl in e, or singly with
    bad exponent.

23
Extensions Open Problems
  • Weighted edges or incomplete graph
  • Recent work by Bartal et al
  • log-approximation based on multiway cut
  • Better constant for unweighted case
  • Can we use bad triangles (or polygons) more
    directly for a tighter bound?
  • Experimental performance

24
Other problems I have worked on
  • Game Theory and Mechanism Design
  • Approx for Orienteering related problems
  • Online search algorithms based on Machine
    Learning approaches
  • Theoretical properties of Power Law graphs
  • Currently working on Privacy with Cynthia

25
  • Thanks!

26
using the lower bound delta clean clusters
  • give proof that delta clean is \leq 4 opt

27
proof outline
  • delta clean lt 4opt
  • but there may not be a delta clean clustering!!
  • show that there is a clustering that is close to
    delta clean clusters are either delta clean or
    singleton
  • there exists such a clustering close to opt
  • we will try to find this clustering
  • (copy nikhils slide)

28
existence of opt(delta)
  • proof

29
algorithm
  • pictorially use nikhils slides
  • brief outline of how it does that

30
bounding the cost
  • clusters containing opt(delta)s clusters are ¼
    clean
  • rest have at most as many mistakes as opt(delta)

31
random noise?
Write a Comment
User Comments (0)
About PowerShow.com