Exploring similarities in highdimensional datasets - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Exploring similarities in highdimensional datasets

Description:

Allows merging intervals in a dim. Evaluation Metrics. Clustering C is evaluated using: ... 2884 17-dim vectors. Outline. Motivation. Related Work ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 33
Provided by: csr22
Category:

less

Transcript and Presenter's Notes

Title: Exploring similarities in highdimensional datasets


1
Exploring similarities in high-dimensional
datasets
  • Karlton Sequeira
  • Computer Science Department
  • RPI, Troy, New York

2
Outline
  • Motivation
  • Related Work
  • Contributions

3
Motivation
  • Doctors sharing similar patient information
  • Identifying common segments across different
    markets
  • Schema matching
  • Protein alignment
  • Dataset evolution

4
Challenges
  • Transformations rotation, translation,
    permutation, dilation
  • High-dimensions sparsity, irrelevant dimensions
  • Heterogeneity different schema, variables
    nominal /continuous
  • Computational complexity
  • Interpretability

5
Proposed Thesis
  • We propose to detect similarities across datasets
    by
  • representing them condensely and identifying
  • similarities between these condensed
    representations.

6
Preliminaries
  • Let U A1 A2 . . . Ad be the
    high-dimensional space.
  • DB rii ? 1, n, ri ? U.
  • A subspace is a grid-aligned hyper-rectangle,
  • l1, h1l2, h2. . .ld, hd ? U, li, hi ?
    Ai
  • A dimension is constrained if li, hi ? Ai
  • Sp subspace constrained in p dimensions, i.e.,
    p-subspace
  • n(Sp) number of points in Sp

7
Outline
  • Motivation
  • Related Work
  • Contributions

8
Condensed representation of dataset
  • Find components
  • Clustering, decision trees, PCA, etc.
  • Data lies in different, overlapping p-subspaces
    (pltltd), i.e., Subspace Mining
  • Density-based (CLIQUE, ENCLUS, MAFIA, ProDenClu)
    Impose grid over d-space, find dense
    projections, merge them using Apriori
  • Projected (PROCLUS, ORCLUS, DOC)
  • Find inter-component relationships
  • Information theory Kullback-Leibler-distance
  • Clustering comparison Rand Index, Jaccard Index,
    VI Metric

9
Identifying similarities between condensed
representations
  • Dataset Similarity FOCUS, Li et al,
    Parathasarathy et al. etc.
  • Graph Matching
  • Search Ullman, edit graph distance, maximal
    common subgraph, entropy, (consistent subgraphs)
  • Optimization max p(matchingattributes), NN, GA,
    tabu-search, LP, SA, etc.
  • Flow-based SimFlood, Blondel et al., Van Wyk et
    al.

assume identical schema, dataset access
Labeled/discrete attributed datasets
10
Outline
  • Motivation
  • Related Work
  • Contributions
  • Condensed Representation
  • Finding subspaces in a dataset
  • Finding relationships between subspaces
  • Similarities between Condensed Representations

11
Drawbacks of density-based subspace mining methods
  • Sp is interesting ? n(Sp) gt sn
  • regardless of Sps volume !!
  • So, thresh is set very low to find
  • interesting subspaces, especially
  • high-dimensional ones

Sp is interesting if n(Sp) is statistically
significantly higher than expected, assuming the
dimensions are independently distributed
12
Interestingness Measure
  • For given Sp, let Xp be r.v. for n(Sp), ? ltlt 1 is
    user-specified
  • Pr(Xp ? n(Sp)) ? ? ? Sp is an interesting
    subspace
  • E(Xp)
    E(Xp)nt

  • ?

13
Chernoff-Hoeffding bound
  • If Yi 1n are independently distributed r.v.
  • 0 ? Yi ? 1, VarYi lt ?, Y ?i1n Yi, t gt 0,
  • PrY ? EY nt ? exp(-2nt2)
  • For a given Sp, let Yi 1 if ?ri ? Sp
  • 0
    otherwise
  • Then Y Xp
  • Sp is interesting ? Pr(Xp ? n(Sp)) ? exp(-2nt2) ?
    ? ?

14
Interestingness Measure
  • E.g., if Ai is uniformly distributed and Sp is
    constrained to a
  • single interval in p dimensions,
  • Sp is interesting ?

Sp is interesting ? n(Sp) ? EXp ?-ln(?)
n n
2n
CLIQUE n(Sp) ? ns
MAFIA n(Sp) ? ?EXp
nonlinear strictly non-increasing in p
n(Sp) ? 1 ?-ln(?) n ?p
2n
15
SCHISM
  • SCHISM(DB, ?, ?)
  • DDB Discretise(DB, ?)
  • VDB HorizontalToVertical(DDB)
  • MIS MineSubspaces(VDB, ?, ?)
  • AssignPoints(DB,MIS)

16
Synthetic datasets
  • Multivariate Gaussian clusters
  • ? (constrained dim) U0,1000
  • ? (constrained dim) 20
  • p P(dimension is constrained) .5
  • o P(dimension in adjacent clusters are
    constrained) .5
  • ? maxx?X n(x)/minx?X n(x) X is set of embedded
    subspaces
  • 5 noise i.e. multivariate U0,1000

17
MineSubspaces - Example
Depth-first search
18
Assigning points to subspaces
  • MergeSubspaces(MIS,Sp)
  • If maxj ?(Sp,MISj) gt ThresholdSim
  • MISargmaxj ?(ri,MISj) MISargmaxj ?(ri,MISj)
    ? Sp
  • AssignPoints(DB,MIS)
  • For each point ri ? DB
  • if maxj ?(ri,MISj) gt
  • ri ? MISargmaxj ?(ri,MISj)
  • else ri is an outlier
  • ?(A,B)

Allows merging intervals in a dim
d ?-dln(?) ? 2
?i1d Ai ? Bi Ai ? Bi
19
Evaluation Metrics
  • Clustering C is evaluated using
  • Running time (minutes)
  • Entropy E(C) ?Cj ((nj/n)?i pij log(pij))
    pij nij/n nij is the number of points in
    the jth cluster Cj of C, actually belonging to
    subspace i.
  • Coverage The fraction of DB correctly labeled as
    not being outliers.
  • Ideally, E(C )0, Coverage(C )1.0

20
Experiments (Synthetic datasets)
Performance v/s n
Performance v/s dPerformance
v/s ? Performance v/s p
21
Experiments
Effect of overlapping
Effect of constraining subspacesamong
subspace dimensions to
share pointsEffect of ?

Performance on embedded

hyper-rectangular subspaces
22
Experiments
Effect of k
Effect of ?

SCHISM v/sEffect of ?
CLIQUE
23
Experiments (Real Datasets)
Pendigits DB has 7494 16-dim vectors Microarray
DB has 2884 17-dim vectors
24
Outline
  • Motivation
  • Related Work
  • Contributions
  • Condensed Representation
  • Finding subspaces in a dataset
  • Finding inter-component relationships
  • Similarities between Condensed Representations

25
Inter-component relationships
for structural matching
  • Each subspace is represented as a set of
    histograms
  • Dot product sim(a,b) ab
  • Gaussian weighted sim(a,b) exp(-(a-b)2/(2s2))
  • Increasing weighted sim(a,b) 1/(1a-b/s)
  • Sigmoidal weighted sim(a,b) 2log(cosh(?a-b/s)
    )

W(u,v)?r?u ?(r,v) ?r?v ?(r,u)
u v
W(u,v) 1?i1d ?j1 ? sim(u(i,j),v(i,j))
d?
26
Outline
  • Motivation
  • Related Work
  • Contributions
  • Condensed Representation
  • Similarities between Condensed Representations

27
Similarities between condensed representations
  • ?u,u ?Va ?v,v ?Vb,
  • W((u,v),(u,v)) sim(w(u,u),w (v,v))
    if sim(w(u,u),w (v,v)) gt ??
  • 0
    otherwise
  • PairwiseAlignGraphs(Ga, Gb, ?, k)
  • create G (Va ? Vb,E ? (Va ? Vb) ? (Va ?
    Vb), W)
  • set Sim0 to 1
  • for i1k SimiWSimi-1
  • output HungarianMatch(Sim)

Ga
Gb
28
Evaluation Metrics
  • Clustering C is evaluated using
  • Running time (seconds)
  • Z-score ideally very negative
  • matches

29
Experiments (Synthetic datasets)
30
Experiments
31
Solving for multiple graphs
  • Which set F ? E of edges, must be removed from
    the k-partite graph G,
  • having minimum sum of weights, while ensuring
    that each resulting
  • connected component contains vertices from
    distinct underlying graphs?
  • Consider Minimum Multicut problem (GVY 93)
  • Given an undirected graph G (V,E), t pairs of
    vertices (ri,si), i 1t
  • costs ce ? 0 for each edge e ? E, find a
    minimum-cost set F ? E such
  • that ?i, ri and si are in different connected
    components of G'(V,E-F)
  • IP min ?e ? E Sim(e)xe
    s.t.
  • ?e ? P xe ? 1 ? P ? Puv,
    ? u ? Vi,v ? V\Vi, i ? 1,t
  • xe ? 0,1

32
Future Work
  • Extending to multiple datasets
  • Experiments on real datasets
  • Defining an unusual match statistically
Write a Comment
User Comments (0)
About PowerShow.com