Title: SCHISM A new approach for interesting subspace mining
1SCHISM A new approach for interesting subspace
mining
- Karlton Sequeira Mohammed Zaki
- Computer Science Department
- RPI, Troy, New York 12180.
2Motivation
- Many clustering algorithms fail on high-dim
datasets due to - Sparsity, causing many distance functions to
provide little contrast between nearest and
farthest neighbors (Beyer 99) - Some irrelevant dimensions
- Feature selection searches only a single subspace
for clusters - KLT/PCA may be unsuitable (Agrawal 98) because
- assumes data concentration in mutually orthogonal
linear directions - output is hard to interpret
- Data may lie in different and sometimes
overlapping lower dimensional subspaces.
3Preliminaries
- Let U A1 A2 . . . Ad be the high
dimensional space. - DB rii ? 1, n, ri ? U.
- A subspace is a grid-aligned hyper-rectangle,
- l1, h1l2, h2. . .ld, hd ? U, li, hi ?
Ai - A dimension is constrained if li, hi ? Ai
- Sp subspace constrained in p dimensions, i.e.,
p-subspace - n(Sp) number of points in Sp
Sp is interesting if n(Sp) is
statistically significantly higher than that
expected, assuming the dimensions are
independently distributed
4Drawbacks of Apriori-based methods
- Apriori-based methods
- discretize Ai into ? intervals
- ?p, n(Sp) gt thresh
- ? Sp is interesting,
- regardless of Sps volume !
- So, thresh is set very low
- to find interesting
- subspaces, especially high-
- dimensional ones
SCHISM CLIQUE
Thresh(p)
p
5Interestingness Measure
- Let Xp be r.v. for n(Sp), ? ltlt 1 is
user-specified - Pr(Xp ? n(Sp)) ? ? ? Sp is an interesting
subspace - E(Xp)
E(Xp)nt -
?
6Chernoff-Hoeffding bound
- If Yi 1n are independently distributed r.v. 0
? Yi ? 1, VarYi lt ?, Y ?i1n Yi, t gt 0, - PrY ? EY nt ? exp(-2nt2)
- For a given Sp, let Yi 1 if ?ri ? Sp
- 0
otherwise - Then Y Xp
- Sp is interesting ? Pr(Xp ? n(Sp)) ? exp(-2nt2) ?
? ?
7Interestingness Measure
- If Ai is uniformly distributed and Sp is
constrained to a single - interval in p dimensions,
- Sp is interesting ?
Sp is interesting ? n(Sp) ? EXp ?-ln(?)
n n
2n
CLIQUE n(Sp)/n ? s
MAFIA n(Sp) ? ?EXp
nonlinear strictly non-increasing in p
n(Sp) ? 1 ?-ln(?) n ?p
2n
8SCHISM
- SCHISM(DB, ?, ?)
- DDB Discretize(DB, ?)
- VDB HorizontalToVertical(DDB)
- MIS MineSubspaces(VDB, ?, ?)
- AssignPoints(DB,MIS)
9MineSubspaces - Example
Depth-first search
10Assigning points to subspaces
- MergeSubspaces(MIS,Sp)
- If maxj Sim(Sp,MISj) gt ThresholdSim
- MIS (MIS ? (MISj ? Sp)) \ MISj
- AssignPoints(DB,MIS)
- For each point ri ? DB
- if maxj Sim(ri,MISj) gt ThresholdSim
- ri ? MISargmaxj Sim(ri,MISj)
- else ri is an outlier
- Sim(A,B)
ThresholdSim
Allows merging of adjacent intervals
?i1d Ai ? Bi Ai ? Bi
11Evaluation Metrics
- Clustering C is evaluated using
- Entropy E(C) ?Cj ((nj/n)?i pij log(pij))
pij nij/n - nij is the number of points in the jth cluster Cj
of C, actually belonging to subspace i. Ideally,
E(C )0 - Coverage The fraction of DB correctly labeled as
not being outliers. Ideally, Coverage(C )1.0
12Experiments (Synthetic datasets)
Performance v/s n
Performance v/s dPerformance
v/s
Performance v/s pSps dataset
coverage
13Experiments
Effect of overlapping
Effect of constraining subspacesamong
subspace dimensions to
share pointsEffect of ?
Performance on embedded
hyper-rectangular subspaces
14Experiments
Effect of k
Effect of ?
SCHISM v/sEffect of ?
CLIQUE
15Experiments (Real Datasets)
Pendigits DB has 7494 16-dim vectors Microarray
DB has 2884 17-dim vectors
16Related Work
- Density-based
- Run Apriori on interesting subspaces
- CLIQUE uses support threshold i.e. constant s
- ENCLUS uses entropy threshold i.e. constant ?
- MAFIA merges adjacent intervals having similar
distributions, thresh?E(Xp) - SUBCLU uses core objects
- Projection-based PROCLUS, ORCLUS, DOC
17Contributions/Future Work
- Provides absolute guarantees of interestingness
- Easier to set parameters
- Application to finding similarities in datasets
- Improve association rule mining
- Find subspaces using non-linear dimensionality
reduction techniques