Jeremy Tantrum, - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Jeremy Tantrum,

Description:

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Jeremy Tantrum, Department of Statistics, University of Washington – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 32
Provided by: Mscc
Learn more at: https://stat.uw.edu
Category:

less

Transcript and Presenter's Notes

Title: Jeremy Tantrum,


1
Hierarchical Model-Based Clustering of Large
Datasets Through Fractionation and Refractionation
  • Jeremy Tantrum,
  • Department of Statistics,
  • University of Washington
  • joint work with
  • Alejandro Murua Werner Stuetzle
  • Insightful Corporation University of
    Washington

This work has been supported by NSA grant 62-1942
2
Motivating Example
  • Consider clustering documents
  • Topic Detection and Tracking corpus
  • 15,863 news stories for one year from Reuters
    and CNN
  • 25,000 unique words
  • Possibly many topics
  • Large numbers of observations
  • High dimensions
  • Many groups

3
Goal of Clustering
Detect that there are 5 or 6 groups Assign
Observations to groups
4
NonParametric Clustering
  • Premise
  • Observations are sampled from a density p(x)
  • Groups correspond to modes of p(x)

5
NonParametric Clustering
Fitting Estimate p(x) nonparametrically and
find significant modes of the estimate
6
Model Based Clustering
  • Premise
  • Observations are sampled from a mixture density
  • p(x) å pg pg(x)
  • Groups correspond to mixture components

7
Model Based Clustering
Fitting Estimate pg and parameters of pg(x)
8
Model Based Clustering
  • Fitting a Mixture of Gaussians
  • Use the EM algorithm to maximize the log
    likelihood
  • Estimates the probabilities of each observation
    belonging to each group
  • Maximizes likelihood given these probabilites
  • Requires a good starting point

9
Model Based Clustering
  • Hierarchical Clustering
  • Provides a good starting point for EM algorithm
  • Start with every point being its own cluster
  • Merge the two closest clusters
  • Measured by the decrease in likelihood when those
    two clusters are merged
  • Uses the Classification Likelihood not the
    Mixture Likelihood
  • Algorithm is quadratic in the number of
    observations

10
Likelihood Distance
p (x)
p1(x)
p2(x)
11
Bayesian Information Criterion
  • Choose number of clusters by maximizing the
    Bayesian Information Criterion
  • r is the number of parameters
  • n is the number of observations
  • Log likelihood penalized for complexity

12
Fractionation
Invented by Cutting, Karger, Pederson and Tukey
for nonparametric clustering of large datasets.
M is the largest number of observations for which
a hierarchical O(M2) algorithm is computationally
feasible
13
Fractionation
  • an meta-observations after the first round
  • a2n meta-observations after the second round
  • ain meta-observations after the ith round
  • For the ith pass, we have ai-1n/M fractions
    taking O(M2) operations each
  • Total number of operations is
  • Total running time is linear in n!

14
Model Based Fractionation
  • Use model based clustering
  • Meta-observations contain all sufficient
    statistics (ni, mi, Si)
  • ni is the number of observations size
  • mi is the mean location
  • Si is the covariance matrix shape and volume

15
Model Based Fractionation
16
Example 2
17
Refractionation
  • Problem
  • If the number of meta-observations generated from
    a fraction is less than the number of groups in
    that fraction then two or more groups will be
    merged.
  • Once observations from two groups are merged they
    can never be split again.
  • Solution
  • Apply fractionation repeatedly.
  • Use meta-observations from the previous pass of
    fractionation to create better fractions.

18
Example 2 Continued
19
Example 2 Pass 2
20
Example 2 Pass 3
21
Realistic Example
  • 1100 documents from the TDT corpus partitioned by
    people into 19 topics
  • Transformed into 50 dimensional space using
    Latent Semantic Indexing

Projection of the data onto a plane
colors represent topics
22
Realistic Example
Want to create a dataset with more observations
and more groups Idea Replace each group with a
scaled and transformed version of the entire data
set.
23
Realistic Example
Want to create a dataset with more observations
and more groups Idea Replace each group with a
scaled and transformed version of the entire data
set.
24
Realistic Example
  • To measure similarity of clusters to groups
  • Fowlkes-Mallows index
  • Geometric average of
  • Probability of 2 randomly chosen observations
    from the same cluster being in the same group
  • Probability of 2 randomly chosen observations
    from the same group being in the same cluster
  • FowlkesMallows index near 1 means clusters are
    good estimates of the groups
  • Clustering the 1100 documents gives a
    FowlkesMallows index of 0.76 our gold
    standard

25
Realistic Example
  • 1919361 clusters, 19110020900 observations in
    50 dimensions
  • Fraction size¼1000 with 100 metaobservations per
    fraction
  • 4 passes of fractionation choosing 361 clusters

Number of fractions
Pass Min Median Max nf
1 270 289 296 20
2 18 88 150 18
3 18 19 60 17
4 19 19 58 16
Distribution of the number of groups per fraction.
26
Realistic Example
  • 1919361 clusters, 19110020900 observations in
    50 dimensions
  • Fraction size¼1000 with 100 metaobservations per
    fraction
  • 4 passes of fractionation choosing 361 clusters
  • The sum of the number of groups represented in
    each cluster
  • 361 is perfect

Pass Fowlkes Mallows Purity of the clusters
1 0.325 1729
2 0.554 908
3 0.616 671
4 0.613 651
27
Realistic Example
  • 1919361 clusters, 19110020900 observations in
    50 dimensions
  • Fraction size¼1000 with 100 metaobservations per
    fraction
  • 4 passes of fractionation choosing 361 clusters
  • Refractionation
  • Purifies fractions
  • Successfully deals with the case where the
    number of groups is greater than aM, the number
    of meta-observations

28
Contributions
  • Model Based Fractionation
  • Extended fractionation idea to parametric setting
  • Incorporates information about size, shape and
    volume of clusters
  • Chooses number of clusters
  • Still linear in n
  • Model Based ReFractionation
  • Extended fractionation to handle larger number of
    groups

29
Extensions
  • Extend to 100,000s of observations 1000s of
    groups
  • Currently the number of groups must be less than
    M
  • Extend to a more flexible class of models
  • With small groups in high dimensions, we need a
    more constrained model (fewer parameters) than
    the full covariance model
  • Mixture of Factor Analyzers

30
(No Transcript)
31
Fowlkes-Mallows Index
true clusters clusters clusters clusters
Groups 1 2 I Total
1 n11 n12 n1I n1
2 n21 n22 n2I n1

J nJ1 nj2 nJI n1
Total n1 n2 nI n
Pr(2 documents in same group they are in
the same cluster)
Pr(2 documents in same cluster they are
in the same group)
Write a Comment
User Comments (0)
About PowerShow.com