Title: Exploring similarities in highdimensional datasets
1Exploring similarities in high-dimensional
datasets
- Karlton Sequeira
- Computer Science Department
- RPI, Troy, New York
2Outline
- Motivation
- Related Work
- Contributions
3Motivation
- Doctors sharing similar patient information
- Identifying common segments across different
markets - Schema matching
- Protein alignment
- Dataset evolution
4Challenges
- Transformations rotation, translation,
permutation, dilation - High-dimensions sparsity, irrelevant dimensions
- Heterogeneity different schema, variables
nominal /continuous - Computational complexity
- Interpretability
5Proposed Thesis
- We propose to detect similarities across datasets
by - representing them condensely and identifying
- similarities between these condensed
representations.
6Preliminaries
- Let U A1 A2 . . . Ad be the
high-dimensional space. - DB rii ? 1, n, ri ? U.
- A subspace is a grid-aligned hyper-rectangle,
- l1, h1l2, h2. . .ld, hd ? U, li, hi ?
Ai - A dimension is constrained if li, hi ? Ai
- Sp subspace constrained in p dimensions, i.e.,
p-subspace - n(Sp) number of points in Sp
7Outline
- Motivation
- Related Work
- Contributions
8Condensed representation of dataset
- Find components
- Clustering, decision trees, PCA, etc.
- Data lies in different, overlapping p-subspaces
(pltltd), i.e., Subspace Mining - Density-based (CLIQUE, ENCLUS, MAFIA, ProDenClu)
Impose grid over d-space, find dense
projections, merge them using Apriori - Projected (PROCLUS, ORCLUS, DOC)
- Find inter-component relationships
- Information theory Kullback-Leibler-distance
- Clustering comparison Rand Index, Jaccard Index,
VI Metric
9Identifying similarities between condensed
representations
- Dataset Similarity FOCUS, Li et al,
Parathasarathy et al. etc. - Graph Matching
- Search Ullman, edit graph distance, maximal
common subgraph, entropy, (consistent subgraphs)
- Optimization max p(matchingattributes), NN, GA,
tabu-search, LP, SA, etc. - Flow-based SimFlood, Blondel et al., Van Wyk et
al.
assume identical schema, dataset access
Labeled/discrete attributed datasets
10Outline
- Motivation
- Related Work
- Contributions
- Condensed Representation
- Finding subspaces in a dataset
- Finding relationships between subspaces
- Similarities between Condensed Representations
11Drawbacks of density-based subspace mining methods
- Sp is interesting ? n(Sp) gt sn
- regardless of Sps volume !!
- So, thresh is set very low to find
- interesting subspaces, especially
- high-dimensional ones
Sp is interesting if n(Sp) is statistically
significantly higher than expected, assuming the
dimensions are independently distributed
12Interestingness Measure
- For given Sp, let Xp be r.v. for n(Sp), ? ltlt 1 is
user-specified - Pr(Xp ? n(Sp)) ? ? ? Sp is an interesting
subspace - E(Xp)
E(Xp)nt -
?
13Chernoff-Hoeffding bound
- If Yi 1n are independently distributed r.v.
- 0 ? Yi ? 1, VarYi lt ?, Y ?i1n Yi, t gt 0,
- PrY ? EY nt ? exp(-2nt2)
- For a given Sp, let Yi 1 if ?ri ? Sp
- 0
otherwise - Then Y Xp
- Sp is interesting ? Pr(Xp ? n(Sp)) ? exp(-2nt2) ?
? ?
14Interestingness Measure
- E.g., if Ai is uniformly distributed and Sp is
constrained to a - single interval in p dimensions,
- Sp is interesting ?
Sp is interesting ? n(Sp) ? EXp ?-ln(?)
n n
2n
CLIQUE n(Sp) ? ns
MAFIA n(Sp) ? ?EXp
nonlinear strictly non-increasing in p
n(Sp) ? 1 ?-ln(?) n ?p
2n
15SCHISM
- SCHISM(DB, ?, ?)
- DDB Discretise(DB, ?)
- VDB HorizontalToVertical(DDB)
- MIS MineSubspaces(VDB, ?, ?)
- AssignPoints(DB,MIS)
16Synthetic datasets
- Multivariate Gaussian clusters
- ? (constrained dim) U0,1000
- ? (constrained dim) 20
- p P(dimension is constrained) .5
- o P(dimension in adjacent clusters are
constrained) .5 - ? maxx?X n(x)/minx?X n(x) X is set of embedded
subspaces - 5 noise i.e. multivariate U0,1000
17MineSubspaces - Example
Depth-first search
18Assigning points to subspaces
- MergeSubspaces(MIS,Sp)
- If maxj ?(Sp,MISj) gt ThresholdSim
- MISargmaxj ?(ri,MISj) MISargmaxj ?(ri,MISj)
? Sp - AssignPoints(DB,MIS)
- For each point ri ? DB
- if maxj ?(ri,MISj) gt
- ri ? MISargmaxj ?(ri,MISj)
- else ri is an outlier
- ?(A,B)
Allows merging intervals in a dim
d ?-dln(?) ? 2
?i1d Ai ? Bi Ai ? Bi
19Evaluation Metrics
- Clustering C is evaluated using
- Running time (minutes)
- Entropy E(C) ?Cj ((nj/n)?i pij log(pij))
pij nij/n nij is the number of points in
the jth cluster Cj of C, actually belonging to
subspace i. - Coverage The fraction of DB correctly labeled as
not being outliers. - Ideally, E(C )0, Coverage(C )1.0
20Experiments (Synthetic datasets)
Performance v/s n
Performance v/s dPerformance
v/s ? Performance v/s p
21Experiments
Effect of overlapping
Effect of constraining subspacesamong
subspace dimensions to
share pointsEffect of ?
Performance on embedded
hyper-rectangular subspaces
22Experiments
Effect of k
Effect of ?
SCHISM v/sEffect of ?
CLIQUE
23Experiments (Real Datasets)
Pendigits DB has 7494 16-dim vectors Microarray
DB has 2884 17-dim vectors
24Outline
- Motivation
- Related Work
- Contributions
- Condensed Representation
- Finding subspaces in a dataset
- Finding inter-component relationships
- Similarities between Condensed Representations
25Inter-component relationships
for structural matching
- Each subspace is represented as a set of
histograms - Dot product sim(a,b) ab
- Gaussian weighted sim(a,b) exp(-(a-b)2/(2s2))
- Increasing weighted sim(a,b) 1/(1a-b/s)
- Sigmoidal weighted sim(a,b) 2log(cosh(?a-b/s)
)
W(u,v)?r?u ?(r,v) ?r?v ?(r,u)
u v
W(u,v) 1?i1d ?j1 ? sim(u(i,j),v(i,j))
d?
26Outline
- Motivation
- Related Work
- Contributions
- Condensed Representation
- Similarities between Condensed Representations
27Similarities between condensed representations
- ?u,u ?Va ?v,v ?Vb,
- W((u,v),(u,v)) sim(w(u,u),w (v,v))
if sim(w(u,u),w (v,v)) gt ?? - 0
otherwise - PairwiseAlignGraphs(Ga, Gb, ?, k)
- create G (Va ? Vb,E ? (Va ? Vb) ? (Va ?
Vb), W) - set Sim0 to 1
- for i1k SimiWSimi-1
- output HungarianMatch(Sim)
Ga
Gb
28Evaluation Metrics
- Clustering C is evaluated using
- Running time (seconds)
- Z-score ideally very negative
- matches
29Experiments (Synthetic datasets)
30Experiments
31Solving for multiple graphs
- Which set F ? E of edges, must be removed from
the k-partite graph G, - having minimum sum of weights, while ensuring
that each resulting - connected component contains vertices from
distinct underlying graphs? - Consider Minimum Multicut problem (GVY 93)
- Given an undirected graph G (V,E), t pairs of
vertices (ri,si), i 1t - costs ce ? 0 for each edge e ? E, find a
minimum-cost set F ? E such - that ?i, ri and si are in different connected
components of G'(V,E-F) - IP min ?e ? E Sim(e)xe
s.t. - ?e ? P xe ? 1 ? P ? Puv,
? u ? Vi,v ? V\Vi, i ? 1,t - xe ? 0,1
32Future Work
- Extending to multiple datasets
- Experiments on real datasets
- Defining an unusual match statistically