MAFIA: Adaptive Grids for Clustering Massive Data Sets - PowerPoint PPT Presentation

About This Presentation
Title:

MAFIA: Adaptive Grids for Clustering Massive Data Sets

Description:

Which windows are cluster candidates? CLIQUE: use user input threshold ... Find candidate dense units (by combining dimensions) ... – PowerPoint PPT presentation

Number of Views:403
Avg rating:3.0/5.0
Slides: 17
Provided by: cseOhi
Category:

less

Transcript and Presenter's Notes

Title: MAFIA: Adaptive Grids for Clustering Massive Data Sets


1
MAFIA Adaptive Grids for Clustering Massive Data
Sets
  • Harsha Nagesh, Sanjay Goil, Alok Choudhury
  • -Udeepta Bordoloi

2
Clustering Algorithms
  • BIRCH
  • ROCK
  • CLIQUE
  • Inputs grid size and density threshold
  • Prunes subspaces
  • MAFIA
  • Adaptive grid size
  • Inputs density threshold
  • No pruning of subspaces

3
Grids CLIQUE way
  • Along each dimension
  • Divide the whole range into intervals (windows)
    of size given by user.
  • Threshold the number of points in each interval
    by the user input density to get clusters.

4
Grids MAFIA way
  • Along each dimension
  • Divide the whole range into many small windows.
  • Compute a histogram (Assuming discrete data
    here).
  • E.g., we can divide the range of natural numbers
    (1-15) into 5 windows (1-3, 4-6,,13-15).
  • Value of a window max(histogram value within
    the window)
  • E.g., if there are three 1s, zero 2s, and five
    3s, then the value of the first window (1-3)
    three.

5
Grids MAFIA way
  • Along each dimension (contd.)
  • From L-to-R, merge adjacent windows which differ
    by less than threshold ß.
  • Can be made a user input, but they hard-coded it
    (25-75)
  • What if cannot detect any partition?
  • Divide the range equally.

6
Compare
MAFIA
CLIQUE
7
Which windows are cluster candidates?
  • CLIQUE use user input threshold
  • MAFIA use user input threshold normalized to
    window size
  • Cluster dominance factor a
  • Reports clusters as DNF expressions
  • Cluster candidates henceforth referred to as
    Candidate Dense Units (CDU)

8
Algorithm Initialization
  • B number of records that fit into memory
  • Read data in chunks of B and build histogram for
    each dimension.
  • Determine the adaptive windows for each
    dimension, and the normalized thresholds for each
    window.
  • Get the candidate windows in each dimension.
  • Variable of working dimension, k 1.

9
Main Loop
  • Repeat
  • k
  • Find candidate dense units (by combining
    dimensions)
  • Read through the data to find how many points lie
    in each of these CDUs
  • Find the true dense units.
  • Until (no more dense units found)
  • Report the true dense units as clusters.

10
Building CDUs
  • CDUs in k dimensions
  • merge two dense units of (k-1) dimensions.
  • such that they share any (k-2) dimensions.
  • each dense unit of (k-1) dims has to be compared
    with every other dense unit.
  • can lead to duplicate CDUs, compare every CDU
    with every other CDU.
  • Dense units which cannot be combined are a
    potential cluster (in a subspace).

11
Building CDU example (2D?3D)
  • We can get repeated CDUs
  • Two passes required.
  • To combine two 2D units to one 3D unit.
  • To eliminate repeated CDUs.

12
Variables (Recap)
  • Cluster dominance factor, a
  • High a, strong clusters and vice-versa.
  • Usual value 1.5
  • Window merging threshold, ß
  • High ß, fine windows and vice-versa.

13
MAFIA vs. CLIQUE (speedup)
  • CLIQUE used
  • without pruning.
  • with 10 bins for each dimension.
  • with different thresholds ?

14
MAFIA vs. CLIQUE(number of CDUs computed)
  • Single 7D cluster in a 10D data space
  • CLIQUE 75 6D clusters, 546 7D clusters

15
MAFIA vs. CLIQUE (quality)
  • 2 4D clusters in 10D data space
  • CLIQUE cluster boundary very unreliable
  • On using a variable number of (fixed size) bins
    in each dimension (how?), it misses one cluster.

16
MAFIA (scalability)
Write a Comment
User Comments (0)
About PowerShow.com