SCHISM A new approach for interesting subspace mining - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

SCHISM A new approach for interesting subspace mining

Description:

discretize Ai into intervals p, n(Sp) thresh. Sp is ... If Ai is uniformly distributed and Sp is constrained to a single. interval in p dimensions, ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 18
Provided by: csR4
Category:

less

Transcript and Presenter's Notes

Title: SCHISM A new approach for interesting subspace mining


1
SCHISM A new approach for interesting subspace
mining
  • Karlton Sequeira Mohammed Zaki
  • Computer Science Department
  • RPI, Troy, New York 12180.

2
Motivation
  • Many clustering algorithms fail on high-dim
    datasets due to
  • Sparsity, causing many distance functions to
    provide little contrast between nearest and
    farthest neighbors (Beyer 99)
  • Some irrelevant dimensions
  • Feature selection searches only a single subspace
    for clusters
  • KLT/PCA may be unsuitable (Agrawal 98) because
  • assumes data concentration in mutually orthogonal
    linear directions
  • output is hard to interpret
  • Data may lie in different and sometimes
    overlapping lower dimensional subspaces.

3
Preliminaries
  • Let U A1 A2 . . . Ad be the high
    dimensional space.
  • DB rii ? 1, n, ri ? U.
  • A subspace is a grid-aligned hyper-rectangle,
  • l1, h1l2, h2. . .ld, hd ? U, li, hi ?
    Ai
  • A dimension is constrained if li, hi ? Ai
  • Sp subspace constrained in p dimensions, i.e.,
    p-subspace
  • n(Sp) number of points in Sp

Sp is interesting if n(Sp) is
statistically significantly higher than that
expected, assuming the dimensions are
independently distributed
4
Drawbacks of Apriori-based methods
  • Apriori-based methods
  • discretize Ai into ? intervals
  • ?p, n(Sp) gt thresh
  • ? Sp is interesting,
  • regardless of Sps volume !
  • So, thresh is set very low
  • to find interesting
  • subspaces, especially high-
  • dimensional ones

SCHISM CLIQUE
Thresh(p)
p
5
Interestingness Measure
  • Let Xp be r.v. for n(Sp), ? ltlt 1 is
    user-specified
  • Pr(Xp ? n(Sp)) ? ? ? Sp is an interesting
    subspace
  • E(Xp)
    E(Xp)nt

  • ?

6
Chernoff-Hoeffding bound
  • If Yi 1n are independently distributed r.v. 0
    ? Yi ? 1, VarYi lt ?, Y ?i1n Yi, t gt 0,
  • PrY ? EY nt ? exp(-2nt2)
  • For a given Sp, let Yi 1 if ?ri ? Sp
  • 0
    otherwise
  • Then Y Xp
  • Sp is interesting ? Pr(Xp ? n(Sp)) ? exp(-2nt2) ?
    ? ?

7
Interestingness Measure
  • If Ai is uniformly distributed and Sp is
    constrained to a single
  • interval in p dimensions,
  • Sp is interesting ?

Sp is interesting ? n(Sp) ? EXp ?-ln(?)
n n
2n
CLIQUE n(Sp)/n ? s
MAFIA n(Sp) ? ?EXp
nonlinear strictly non-increasing in p
n(Sp) ? 1 ?-ln(?) n ?p
2n
8
SCHISM
  • SCHISM(DB, ?, ?)
  • DDB Discretize(DB, ?)
  • VDB HorizontalToVertical(DDB)
  • MIS MineSubspaces(VDB, ?, ?)
  • AssignPoints(DB,MIS)

9
MineSubspaces - Example
Depth-first search
10
Assigning points to subspaces
  • MergeSubspaces(MIS,Sp)
  • If maxj Sim(Sp,MISj) gt ThresholdSim
  • MIS (MIS ? (MISj ? Sp)) \ MISj
  • AssignPoints(DB,MIS)
  • For each point ri ? DB
  • if maxj Sim(ri,MISj) gt ThresholdSim
  • ri ? MISargmaxj Sim(ri,MISj)
  • else ri is an outlier
  • Sim(A,B)
    ThresholdSim

Allows merging of adjacent intervals
?i1d Ai ? Bi Ai ? Bi
11
Evaluation Metrics
  • Clustering C is evaluated using
  • Entropy E(C) ?Cj ((nj/n)?i pij log(pij))
    pij nij/n
  • nij is the number of points in the jth cluster Cj
    of C, actually belonging to subspace i. Ideally,
    E(C )0
  • Coverage The fraction of DB correctly labeled as
    not being outliers. Ideally, Coverage(C )1.0

12
Experiments (Synthetic datasets)
Performance v/s n
Performance v/s dPerformance
v/s
Performance v/s pSps dataset
coverage
13
Experiments
Effect of overlapping
Effect of constraining subspacesamong
subspace dimensions to
share pointsEffect of ?

Performance on embedded

hyper-rectangular subspaces
14
Experiments
Effect of k
Effect of ?

SCHISM v/sEffect of ?
CLIQUE
15
Experiments (Real Datasets)
Pendigits DB has 7494 16-dim vectors Microarray
DB has 2884 17-dim vectors
16
Related Work
  • Density-based
  • Run Apriori on interesting subspaces
  • CLIQUE uses support threshold i.e. constant s
  • ENCLUS uses entropy threshold i.e. constant ?
  • MAFIA merges adjacent intervals having similar
    distributions, thresh?E(Xp)
  • SUBCLU uses core objects
  • Projection-based PROCLUS, ORCLUS, DOC

17
Contributions/Future Work
  • Provides absolute guarantees of interestingness
  • Easier to set parameters
  • Application to finding similarities in datasets
  • Improve association rule mining
  • Find subspaces using non-linear dimensionality
    reduction techniques
Write a Comment
User Comments (0)
About PowerShow.com