SCHISM A new approach for interesting subspace mining

About This Presentation

Title:

SCHISM A new approach for interesting subspace mining

Description:

discretize Ai into intervals p, n(Sp) thresh. Sp is ... If Ai is uniformly distributed and Sp is constrained to a single. interval in p dimensions, ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 18

Provided by: csR4

Category:

more less

Transcript and Presenter's Notes

Title: SCHISM A new approach for interesting subspace mining

1
SCHISM A new approach for interesting subspace
mining

Karlton Sequeira Mohammed Zaki
Computer Science Department
RPI, Troy, New York 12180.

2
Motivation

Many clustering algorithms fail on high-dim
datasets due to
Sparsity, causing many distance functions to
provide little contrast between nearest and
farthest neighbors (Beyer 99)
Some irrelevant dimensions
Feature selection searches only a single subspace
for clusters
KLT/PCA may be unsuitable (Agrawal 98) because
assumes data concentration in mutually orthogonal
linear directions
output is hard to interpret
Data may lie in different and sometimes
overlapping lower dimensional subspaces.

3
Preliminaries

Let U A1 A2 . . . Ad be the high
dimensional space.
DB rii ? 1, n, ri ? U.
A subspace is a grid-aligned hyper-rectangle,
l1, h1l2, h2. . .ld, hd ? U, li, hi ?
Ai
A dimension is constrained if li, hi ? Ai
Sp subspace constrained in p dimensions, i.e.,
p-subspace
n(Sp) number of points in Sp

Sp is interesting if n(Sp) is
statistically significantly higher than that
expected, assuming the dimensions are
independently distributed
4
Drawbacks of Apriori-based methods

Apriori-based methods
discretize Ai into ? intervals
?p, n(Sp) gt thresh
? Sp is interesting,
regardless of Sps volume !
So, thresh is set very low
to find interesting
subspaces, especially high-
dimensional ones

SCHISM CLIQUE
Thresh(p)
p
5
Interestingness Measure

Let Xp be r.v. for n(Sp), ? ltlt 1 is
user-specified
Pr(Xp ? n(Sp)) ? ? ? Sp is an interesting
subspace
E(Xp)
E(Xp)nt
?

6
Chernoff-Hoeffding bound

If Yi 1n are independently distributed r.v. 0
? Yi ? 1, VarYi lt ?, Y ?i1n Yi, t gt 0,
PrY ? EY nt ? exp(-2nt2)
For a given Sp, let Yi 1 if ?ri ? Sp
0
otherwise
Then Y Xp
Sp is interesting ? Pr(Xp ? n(Sp)) ? exp(-2nt2) ?
? ?

7
Interestingness Measure

If Ai is uniformly distributed and Sp is
constrained to a single
interval in p dimensions,
Sp is interesting ?

Sp is interesting ? n(Sp) ? EXp ?-ln(?)
n n
2n
CLIQUE n(Sp)/n ? s
MAFIA n(Sp) ? ?EXp
nonlinear strictly non-increasing in p
n(Sp) ? 1 ?-ln(?) n ?p
2n
8
SCHISM

SCHISM(DB, ?, ?)
DDB Discretize(DB, ?)
VDB HorizontalToVertical(DDB)
MIS MineSubspaces(VDB, ?, ?)
AssignPoints(DB,MIS)

9
MineSubspaces - Example
Depth-first search
10
Assigning points to subspaces

MergeSubspaces(MIS,Sp)
If maxj Sim(Sp,MISj) gt ThresholdSim
MIS (MIS ? (MISj ? Sp)) \ MISj
AssignPoints(DB,MIS)
For each point ri ? DB
if maxj Sim(ri,MISj) gt ThresholdSim
ri ? MISargmaxj Sim(ri,MISj)
else ri is an outlier
Sim(A,B)
ThresholdSim

Allows merging of adjacent intervals
?i1d Ai ? Bi Ai ? Bi
11
Evaluation Metrics

Clustering C is evaluated using
Entropy E(C) ?Cj ((nj/n)?i pij log(pij))
pij nij/n
nij is the number of points in the jth cluster Cj
of C, actually belonging to subspace i. Ideally,
E(C )0
Coverage The fraction of DB correctly labeled as
not being outliers. Ideally, Coverage(C )1.0

12
Experiments (Synthetic datasets)
Performance v/s n
Performance v/s dPerformance
v/s
Performance v/s pSps dataset
coverage
13
Experiments
Effect of overlapping
Effect of constraining subspacesamong
subspace dimensions to
share pointsEffect of ?

Performance on embedded

hyper-rectangular subspaces
14
Experiments
Effect of k
Effect of ?

SCHISM v/sEffect of ?
CLIQUE
15
Experiments (Real Datasets)
Pendigits DB has 7494 16-dim vectors Microarray
DB has 2884 17-dim vectors
16
Related Work