Exploring similarities in highdimensional datasets - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Exploring similarities in highdimensional datasets

Description:

Allows merging intervals in a dim. Evaluation Metrics. Clustering C is evaluated using: ... 2884 17-dim vectors. Outline. Motivation. Related Work ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 33

Provided by: csr22

Category:

more less

Transcript and Presenter's Notes

Title: Exploring similarities in highdimensional datasets

1
Exploring similarities in high-dimensional
datasets

Karlton Sequeira
Computer Science Department
RPI, Troy, New York

2
Outline

Motivation
Related Work
Contributions

3
Motivation

Doctors sharing similar patient information
Identifying common segments across different
markets
Schema matching
Protein alignment
Dataset evolution

4
Challenges

Transformations rotation, translation,
permutation, dilation
High-dimensions sparsity, irrelevant dimensions
Heterogeneity different schema, variables
nominal /continuous
Computational complexity
Interpretability

5
Proposed Thesis

We propose to detect similarities across datasets
by
representing them condensely and identifying
similarities between these condensed
representations.

6
Preliminaries

Let U A1 A2 . . . Ad be the
high-dimensional space.
DB rii ? 1, n, ri ? U.
A subspace is a grid-aligned hyper-rectangle,
l1, h1l2, h2. . .ld, hd ? U, li, hi ?
Ai
A dimension is constrained if li, hi ? Ai
Sp subspace constrained in p dimensions, i.e.,
p-subspace
n(Sp) number of points in Sp

7
Outline

Motivation
Related Work
Contributions

8
Condensed representation of dataset

Find components
Clustering, decision trees, PCA, etc.
Data lies in different, overlapping p-subspaces
(pltltd), i.e., Subspace Mining
Density-based (CLIQUE, ENCLUS, MAFIA, ProDenClu)
Impose grid over d-space, find dense
projections, merge them using Apriori
Projected (PROCLUS, ORCLUS, DOC)
Find inter-component relationships
Information theory Kullback-Leibler-distance
Clustering comparison Rand Index, Jaccard Index,
VI Metric

9
Identifying similarities between condensed
representations

Dataset Similarity FOCUS, Li et al,
Parathasarathy et al. etc.
Graph Matching
Search Ullman, edit graph distance, maximal
common subgraph, entropy, (consistent subgraphs)
Optimization max p(matchingattributes), NN, GA,
tabu-search, LP, SA, etc.
Flow-based SimFlood, Blondel et al., Van Wyk et
al.

assume identical schema, dataset access
Labeled/discrete attributed datasets
10
Outline

Motivation
Related Work
Contributions
Condensed Representation
Finding subspaces in a dataset
Finding relationships between subspaces
Similarities between Condensed Representations

11
Drawbacks of density-based subspace mining methods

Sp is interesting ? n(Sp) gt sn
regardless of Sps volume !!
So, thresh is set very low to find
interesting subspaces, especially
high-dimensional ones

Sp is interesting if n(Sp) is statistically
significantly higher than expected, assuming the
dimensions are independently distributed
12
Interestingness Measure

For given Sp, let Xp be r.v. for n(Sp), ? ltlt 1 is
user-specified
Pr(Xp ? n(Sp)) ? ? ? Sp is an interesting
subspace
E(Xp)
E(Xp)nt
?

13
Chernoff-Hoeffding bound

If Yi 1n are independently distributed r.v.
0 ? Yi ? 1, VarYi lt ?, Y ?i1n Yi, t gt 0,
PrY ? EY nt ? exp(-2nt2)
For a given Sp, let Yi 1 if ?ri ? Sp
0
otherwise
Then Y Xp
Sp is interesting ? Pr(Xp ? n(Sp)) ? exp(-2nt2) ?
? ?

14
Interestingness Measure

E.g., if Ai is uniformly distributed and Sp is
constrained to a
single interval in p dimensions,
Sp is interesting ?

Sp is interesting ? n(Sp) ? EXp ?-ln(?)
n n
2n
CLIQUE n(Sp) ? ns
MAFIA n(Sp) ? ?EXp
nonlinear strictly non-increasing in p
n(Sp) ? 1 ?-ln(?) n ?p
2n
15
SCHISM

SCHISM(DB, ?, ?)
DDB Discretise(DB, ?)
VDB HorizontalToVertical(DDB)
MIS MineSubspaces(VDB, ?, ?)
AssignPoints(DB,MIS)

16
Synthetic datasets

Multivariate Gaussian clusters
? (constrained dim) U0,1000
? (constrained dim) 20
p P(dimension is constrained) .5
o P(dimension in adjacent clusters are
constrained) .5
? maxx?X n(x)/minx?X n(x) X is set of embedded
subspaces
5 noise i.e. multivariate U0,1000

17
MineSubspaces - Example
Depth-first search
18
Assigning points to subspaces

MergeSubspaces(MIS,Sp)
If maxj ?(Sp,MISj) gt ThresholdSim
MISargmaxj ?(ri,MISj) MISargmaxj ?(ri,MISj)
? Sp
AssignPoints(DB,MIS)
For each point ri ? DB
if maxj ?(ri,MISj) gt
ri ? MISargmaxj ?(ri,MISj)
else ri is an outlier
?(A,B)

Allows merging intervals in a dim
d ?-dln(?) ? 2
?i1d Ai ? Bi Ai ? Bi
19
Evaluation Metrics

Clustering C is evaluated using
Running time (minutes)
Entropy E(C) ?Cj ((nj/n)?i pij log(pij))
pij nij/n nij is the number of points in
the jth cluster Cj of C, actually belonging to
subspace i.
Coverage The fraction of DB correctly labeled as
not being outliers.
Ideally, E(C )0, Coverage(C )1.0

20
Experiments (Synthetic datasets)
Performance v/s n
Performance v/s dPerformance
v/s ? Performance v/s p
21
Experiments
Effect of overlapping
Effect of constraining subspacesamong
subspace dimensions to
share pointsEffect of ?

Performance on embedded

hyper-rectangular subspaces
22
Experiments
Effect of k
Effect of ?

SCHISM v/sEffect of ?
CLIQUE
23
Experiments (Real Datasets)
Pendigits DB has 7494 16-dim vectors Microarray
DB has 2884 17-dim vectors
24
Outline