Clustering, Jan 2003, Yoon, 1 - PowerPoint PPT Presentation

1 / 88
About This Presentation
Title:

Clustering, Jan 2003, Yoon, 1

Description:

Fast Clustering for XML Bitmap Indexes J. Yoon University of Louisiana at Lafayette Center for Advanced Computer Studies – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 89
Provided by: Jong65
Category:
Tags: clustering | graph | jan | yoon

less

Transcript and Presenter's Notes

Title: Clustering, Jan 2003, Yoon, 1


1
Fast Clustering for XML Bitmap Indexes
  • J. Yoon
  • University of Louisiana at Lafayette
  • Center for Advanced Computer Studies

2
  • Motivating Examples
  • Related Work
  • Bitmap indexing 3-dim, 2-dim, multi-dim, space
    problem (compression save - no retrieval,
    clustering uniformed-bits ?weighted bits)
  • Clustering distance-based (k-means, k-nearest
    neighbor), density-based (k-median),
    entropy-based () / Semantic-based (SLI),
    Topology-based (TOIS-Stanford, ST-based,
    Jagadish)
  • Pass 1-pass clustering ? data streams
  • Preliminaries
  • XML Bitmap index
  • Weighted bits popularity pop(), security sec(),
  • Radius of a cluster rad(),
  • Thresholds similarity, popularity, radius
  • Divide-and-conquer clustering
  • K-means ?Hamming distance
  • clusters ? find centroid
  • (-) takes long O(nk) / compare with LSI in
    quality
  • Weighted-bit-based Divide-and-Conquer clustering
  • Popular bits / Unpopular bits /
  • Similarity ? find centroid similar on popular
    bits and not dissimilar on unpopular
  • One-pass clustering fast centroid O(1) / compare
    with LSI in quality and in speed

3
  • Motivating Examples
  • XML doc retrieval is fast in XML bitmap indexes
    in almost constant-time. Fast but space
    consumption ? shuffle bits to eliminate 0-planes
    ? cluster grouping a very large XML bitmap index
    into many smaller indexes, each of which is yet
    relevant within itself.
  • Related Work
  • Bitmap indexing 3-dim, 2-dim, multi-dim, space
    problem (compression save - no retrieval,
    clustering uniformed-bits ?weighted bits)
  • Clustering distance-based (k-means, k-nearest
    neighbor), density-based (k-median),
    entropy-based () / Semantic-based (SLI),
    Topology-based (TOIS-Stanford, ST-based,
    Jagadish)
  • Pass 1-pass clustering ? data streams
  • Preliminaries
  • XML Bitmap index
  • Weighted bits popularity pop(), security sec(),
  • Radius of a cluster rad(),
  • Thresholds similarity, popularity, radius
  • Divide-and-conquer clustering
  • K-means ?Hamming distance
  • clusters ? find centroid
  • (-) takes long O(nk) / compare with LSI in
    quality
  • Weighted-XPath-based Divide-and-Conquer
    clustering
  • Popular bits / Unpopular bits /
  • Similarity ? find centroid similar on popular
    bits and not dissimilar on unpopular
  • Fast Clustering on Weighted Bits

4
ltpapergt lttitlegtW0lt/titlegt ltauthorgtW1
ltaffiliategtW2lt/affiliategt lt/authorgt
ltsectiongteconomy W3 lt/sectiongt lt/papergt
ltpapergt lttitlegtW0lt/titlegt ltauthorgtW1
ltaffiliategtW2lt/affiliategt lt/authorgt
ltsectiongtW3 ltsectiongteconomylt/sectiongt
lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltaffiliategtW2lt/affiliategt lt/authorgt
ltsectiongtW3 ltsectiongtW4lt/sectiongt
lt/sectiongt lt/papergt
ltpapergt lttitlegtW0lt/titlegt ltauthorgtW1
ltaffiliategteconomylt/affiliategt lt/authorgt
ltsectiongtW3 ltsectiongtW4lt/sectiongt
lt/sectiongt lt/papergt
If full text file economy W1 W2 W3 W4.
Similarity-based centroid is found by Consider
the bitmaps 01011100, 11011000, and
01111111.
Same words in different structures Same words in
the same structures Same structure containing
different words Different words in different
structures, but linking to the same documents
5
K-means
ALGORITHM k-Means Bitmap Clustering INPUT bitmap
index BI, of cluster k, similarity threshold s,
radius threshold r OUTPUT a
number of smaller bitmap indexes METHOD (1)
Select k cluster centroids ci from cluster sets
C1, C2, , Ck (2) For each row b (? BI),
assign b to ci if sim(ci, b) ? s (3) For each
cluster, re-compute the center (4) Continue
from (1) until ? Ci, radius(Ci) ? r
6
Divide--Conquer k-Means
ALGORITHM Bitmap-based Divide--Conquer
Clustering using k-Means INPUT bitmap index BI,
number of clusters k, popularity threshold p,
similarity threshold s, radius
threshold r OUTPUT a number of smaller bitmap
indexes METHOD Let I be a set of integer
Let proj(B,I) be a projected bitmap from a bitmap
B, where I denote positions in B (1) For each
bit b, if pop(b) ? p, then add b to the
popular-bitset P (2) For each clusters centroid
ci in k clusters (3) do For any two
rows b (? BI), (4) if
sim(proj(ci,P), proj(b,P)) ? s, and radius(Ci) is
minimum, then assign b to Ci (5)
otherwise assign b to a new cluster
Ci1 (6) Stop if no more clusters are
obtained. (7) For each cluster Ci, if radius(Ci)
? r, then form a bit-centroid of Ci (8)
otherwise, invoke Bitmap-based
Divide--Conquer Clustering (Ci, p, s, r)
7
Divide--Conquer k-Min
ALGORITHM Bitmap-based Divide--Conquer
Clustering with k-Min INPUT bitmap index BI,
popularity threshold p, similarity threshold s,
radius threshold r OUTPUT a number
of smaller bitmap indexes METHOD Let I be a
set of integer Let proj(B,I) be a projected
bitmap from a bitmap B, where I denote positions
in B (1) For each bit b, if pop(b) ? p, then add
b to the popular-bitset P (2) Let the MIN number
of bits out of P to satisfy the similarity
threshold s be n (3) For each row b (? BI), if
proj(b,P) ? n, then assign b to S (4)
otherwise assign b to U (5) Stop if no more
clusters are obtained. (6) If radius(S) ? r,
form a bit-wise centroid of S (7)
otherwise invoke Bitmap-based Divide--Conquer
Clustering (S, p, s, r) (8) If radius(U) ? r,
form a bit-wise centroid of U (9)
otherwise invoke Bitmap-based Divide--Conquer
Clustering (U, p, s, r)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Scalable Bitmap Indexing for XML Document
Retrieval
  • Plain text-based document (d,w)
  • Document d is a sequence of w, where d for
    document, and w for word
  • Vector space
  • XML document (d, p, w)
  • Document d is a sequence of path p, each of which
    contains a sequence of word w
  • Yet not sufficient to represent all necessary
    information, but simple enough to represent in
    bitmap indexing (i.e., BitCube)
  • Search-related information
  • frequency, structural info (topological info),
    reference info (dereference info), security info,
    etc.

12
Scalable XML Document
  • Scalable XML Document (d, p, w, f, t, r, s, )
  • Extension of XML Document (d, p, w)
  • Represents enough features of documents
  • Bitmap indexing requires a large storage and it
    will be multidimensional.
  • Question
  • How to use a BitCube to represent all the
    features
  • How to query

13
(No Transcript)
14
Related Work Clustering
  • Similarity-based clustering
  • Jaccards Coefficient
  • Dices Coefficient
  • Vector-Space Model
  • Generalized Cosine-Similarity Measure
  • Optimistic Genealogy Measure

15
Similarity Measures
  • The Set/Bag Model Let X and Y be two collections
    of XML documents
  • Jaccards Coefficient
  • Dices Coefficient
  • The Vector-Space Model Cosine-Similarity Measure
    (CSM)

16
Similarity Measures (2)
  • The Generalized Cosine-Similarity Measure (GCSM)
    Let X and Y be vectors and
  • where
  • Hierarchical Model
  • Why only for depth?

17
Related Work Clustering
  • Graph-based clustering
  • For an XML document collection C, s-Graph sg (C)
    (N, E), a directed graph such that N is the set
    of all the elements and attributes in the
    documents in C and (a, b)?E if and only if a is a
    parent element of b in document(s) in C (b can be
    element or attribute).
  • For two sets, C1 and C2, of XML documents, the
    distance between them, where sg(Ci) is the
    number of edges

18
Related Work Clustering
  • Matrix Singular value decomposition
  • ? compare with BitClustering
  • D ? Dk

19
Is it possible to do better?
  • Use Dk instead of D ?calculate DkTq rather than
    DTq
  • Dk is defined at the k-truncated SVD of D
  • Note that both D and Dk are w?d matrices
  • Note
  • U matrix of left singular vector, V matrix of
    right singular vector
  • U and V are orthogonal
  • ? is real

20
Singular Value Decomposition
21
Using D-Hihat
  • Term-term comparison
  • Document-document comparison
  • Document-term comparison

22
Related Work Clustering
  • Structure
  • Suffix tree clustering
  • ? compare with BitClustering
  • Optimistic Genealogy Measure ACM TIS 2003
  • ? compare with BitClustering

23
Streaming Data
  • IEEE TKDE Vol 15 No.3 2003
  • S. Guha, A. Meyerson, N. Mishra, R. Motwani, L.
    OCallaghan, Clustering Data Streams Theory and
    Practice
  • G. Cormode, M. Datar, P. Indyk, and S.
    Muthukrishnana, Comparing Data Streams Using
    Hamming Norms (How to Zero In)
  • A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.
    Strauss, One-Pass Wavelet Decompositions of Data
    Streams
  • P. Tucker, D. Maier, T. Sheard, and L. Fegaras,
    Exploiting Punctuation Semantics in Continuous
    Data Streams

24
Data Streams
  • Characteristics
  • Arrives continuously in the form of a stream
  • Needs to be processed in an on-line fashion
  • Constraints
  • The time for processing each stream element must
    be small
  • The amount of memory available to the query
    processor is limited

25
Algorithms
  • Summarize data streams in a concise, but
    reasonably accurate
  • Sampling-based
  • Histograms, Wavelet methods may be good for
    static data
  • Synopsis-based keep small summary and update
  • One-pass algorithms
  • Obtaining median, quantiles, and other order
    statistics
  • Correlated Aggregate queries, Mining

26
Streaming Data Models
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.
Strauss, One-Pass Wavelet Decompositions of Data
Streams, IEEE TKDE 15(3), pages 541-554.
  • a0..(N-1) ? Z
  • Ex) lt(d1, p1, w1)gt, lt(d1, p2, w2)gt, lt(d2, p1,
    w2)gt, lt(d1, p1, w1)gt, lt(d3, p3, w1)gt, lt(d1,
    p3, w1)gt
  • Cash-register model items on domain values
    (contiguous, but not ordered)
  • Ex) lt(d1, p1, w1,2)gt, lt(d1, p2, w2,1)gt, lt(d2,
    p1, w2,1)gt, lt(d3, p3, w1,1)gt, lt(d1, p3,
    w1,1)gt
  • Aggregate model items on range values (no
    particular order)
  • Ex) lt(d1,w1,3)gt, lt(d1,w2,1)gt, lt(d2, w2,1)gt,
    lt(d3, w1,1)gt

27
Streaming Data Models
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.
Strauss, One-Pass Wavelet Decompositions of Data
Streams, IEEE TKDE 15(3), pages 541-554.
  • a0..(N-1) ? Z
  • Ex) lt(d1, p1, w1)gt, lt(d1, p2, w2)gt, lt(d2, p1,
    w2)gt, lt(d1, p1, w1)gt, lt(d3, p3, w1)gt, lt(d1,
    p3, w1)gt
  • (aggregation of paths) If p2 lt p3,
  • Ex) lt(d1, p1, w1,2)gt, lt(d1, p2, w1,w2,2)gt,
    lt(d2, p1, w2,1)gt, lt(d3, p3, w1,1)gt
  • Aggregation of word
  • Ex) lt(d1,w1,w2,4)gt, lt(d2, w2,1)gt, lt(d3,
    w1,1)gt

28
(No Transcript)
29
BitClustering High-Speed and Context-based
Clustering for XML Documents
30
Problem with Traditional Clustering
  • All epaths are equally significant.
  • Either clustering based on content only or one
    based on structure (hierarchy) only
  • Simple bitmap approach a flat bitmap
  • Our approach
  • The more used the more significant
  • Clustering for both content and structure
  • Complex ? Pipelined Bitmap Index

31
Why clustering should take the structural
information into account
  • Parse Tree (Document Tree)

e1
e2 e3 e4
e5 e5 e6
e7 e7
Example paper (title, section (para, section),
reference (paper))
  • The same word in the Introduction section is
    different from the one in the Conclusion section.
  • Tree contained

32
Significance-based Clustering
  • It is likely that all ePath are not equally
    significant.
  • Ex) e1order.item, e2order.payment.card_number
  • e1 and e2 are not treated equally and uniformly.

ltordergt ltitemgtCD ltdescriptiongtCompact
disklt/descriptiongt ltpricegt9.99lt/pricegt
ltquantitygt5lt/quantitygt lt/itemgt ltitemgtDVD
ltdescriptiongtpopular appliance productlt/descriptio
ngt ltcolorgtsilverlt/colorgt
ltpricegt150.00lt/pricegt ltquantitygt1lt/quantitygt
lt/itemgt ltduegt199.95lt/duegt ltpaymentgt
ltmethodgtcredit cardlt/methodgt
ltcard_numbergt12345lt/card_numbergt
lt/paymentgt lt/ordergt
  • Clearly, for the same reason, column-wise
    security may be popular Oracle9i

33
Encoding
appliance 0 card 1 CD 2 check 3
Compact 4 credit 5 disk 6 DVD 7
Popular 8 product 9 silver 10 tomato
11 TV 12 white 13 o1 14 o2 15
order/_at_customer 0 order/item 1
order/item/description 2 order/item/color 3
order/item/price 4 order/item/quantity 5
order/due 6 order/payment/method 7
order/payment/card_number 8
  • Pairs (Path, Value)

ltorder customero1gt ltitemgtCD
ltdescriptiongtCompact disklt/descriptiongt
ltpricegt9.99lt/pricegt ltquantitygt5lt/quantitygt
lt/itemgt ltitemgtDVD ltdescriptiongtpopular
appliance productlt/descriptiongt
ltcolorgtsilverlt/colorgt ltpricegt150.00lt/pricegt
ltquantitygt1lt/quantitygt lt/itemgt
ltduegt199.95lt/duegt ltpaymentgt ltmethodgtcredit
cardlt/methodgt ltcard_numbergt12345lt/card_numbergt
lt/paymentgt lt/ordergt
0,14 0,15 1,2 1,7 1,10 1,12 2.0
2,4 2,6 2,8 2,9 3,10 3,13 7,3 7,5
7,6
order1.xml order2.xml order3.xml
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
1
0
0
1
1
1
0
1
0
0
order1.xml
34
Similarity Measures
  • The Set/Bag Model Let X and Y be two collections
    of XML documents
  • Jaccards Coefficient
  • Dices Coefficient
  • The Vector-Space Model Cosine-Similarity Measure
    (CSM)

35
Similarity Measures (2)
  • The Generalized Cosine-Similarity Measure (GCSM)
    Let X and Y be vectors and
  • where
  • Hierarchical Model
  • Why only for depth?

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
and n denotes the number of 1s in the path
from the root to a node.
40
Similarity Measures (3)
  • The Optimistic Genealogy Measure (OGM) Let C1
    and C2 be collections of XML documents

41
Significances of Paths
  • Prioritization based on
  • Popularity of paths in usage
  • Significance of content
  • Presetting by user definition

42
Traditional
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW3 ltsectiongtW4lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltsectiongtW3
ltsectiongt ltsectiongtW3lt/sectiongt
lt/sectiongt ltfiguregtW3lt/figuregt
lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgt
ltaffiliategtW2lt/affiliategt lt/authorgt
ltsectiongtW3 ltsectiongtW4
ltfiguregtW3lt/figuregt lt/sectiongt
lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
lt/authorgt ltsectiongt ltsectiongtW4
ltsectiongtW3lt/sectiongt lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW3 ltsectiongtW4lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
lt/authorgt ltsectiongtW3 ltsectiongt
ltsectiongtW3lt/sectiongt lt/sectiongt
lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltsectiongtW3
ltsectiongtW4 ltsectiongtW3lt/sectiongt
lt/sectiongt ltfiguregtW3lt/figuregt
lt/sectiongt lt/papergt
ltpapergt ltauthorgtW1 ltaffiliategtW2lt/affilia
tegt ltemailgtW3lt/emailgt lt/authorgt lt/papergt
1 title 2 author 3 affiliate 4 email 5 section 6
subsection 7 subsubsection 8 figure
Dont Care operation, -operation, can be
implemented 2 bitmaps. 1101 ? 10101 OR 11101.
43
(No Transcript)
44
(No Transcript)
45
K-Means with non-object Centroid
d1 d2 d3
d4 14/25 12/25 7/25
d5 15/25 13/25 10/25
d6 20/25 22/25 19/25
d7 21/25 21/25 20/25
d8 24/25 22/25 17/25
d2,d6 d1, d7, d8d3 d4 d5
46
the similarity threshold is 0.8, the popularity
threshold is 0.5, the radius threshold is 0.8
Aggregation
Pop bitset 1,4,5,11,14,1519
47
System Architecture
Identification of Popular Bits
Form Bit-Centroids
48
Traditional
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW3 ltsectiongtW4
ltsectiongtW3lt/sectiongt lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
W3lt/authorgt ltsectiongtW3 ltsectiongtW4
W3lt/sectiongt ltfiguregtW3lt/figuregt
lt/sectiongt lt/papergt
ltpapergt lttitlegtW6lt/titlegt ltauthorgtW7
ltaffiliategtW2lt/affiliategt
ltemailgtW8lt/emailgt lt/authorgt ltsectiongtW3
ltsectiongtW4 ltsectiongtW3lt/sectiongt
lt/sectiongt ltfiguregtW3lt/figuregt
lt/sectiongt lt/papergt
(d1) (d2)
(d3)
ltpapergt lttitlegtreligionlt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW17 ltsectiongtW18
ltsectiongtW19lt/sectiongt lt/sectiongt
ltfiguregtW20lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW3 ltsectiongtW4
ltsectiongtW3lt/sectiongt lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
1 title 2 author 3 affiliate 4 email 5 section 6
subsection 7 subsubsection 8 figure
d1 d5 are the same in structure d1 d2 are
the same in content d1 d3 are similar in both
structure and content d1 d3 are the same in
content and structure on weighted-XPath d1 d4
are similar in content on weighted-XPath
(d4) (d5)
49
Traditional
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2
centerbit 11011101 for 1,5 ? similar if 5 bits
out of 6 1-bits are the same.. 1000111 for 2,6
? similar if 4 bits out of 4 1-bits are the
same.. 1111 for 3,5,6 ? similar if 4 bits
out of 5 1-bits are the same.. 1111 for
4,5,6 ? similar if 4 bits out of 5 1-bits are
the same.. 1,3,4,5 2,6 7 8
Dont Care operation, -operation, can be
implemented 2 bitmaps. 1101 ? 10101 OR 11101.
50
A
For 8 data 8.61 4.88 5 8.3 2.4 2 For 5
popular bits 5.8 4.0 4 5.2 1.0 1
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2
Distance lt 1 (similar) 1lt x lt 4
(mid-similar) gt 4 (dissimilar)
centerbit 111 for similar documents on
popular bits 110 for mid-similar documents
on popular bits 01000111 for dissimilar documents
on popular bits
centerbit for original array 111 for
similar documents 1101 for mid-similar
documents 01110000 for dissimilar documents
51
B
For 8 data 8.61 4.88 5 8.3 2.4 2 For 5
popular bits 7.8 5.6 6 7.2 1.4 1
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2
Distance lt 1 (similar) 2 lt x lt 5
(mid-similar) gt 6 (dissimilar)
12568 47 3
12568 47 3
11111 10 0 10101 01 0 10111 00 1 11011 01 0 11111
10 0 10111 01 0 11100 01 0 01000 10 1
1 2 3 4 5 6 7 8
1 5 4 6 2 3 7 8
11111 10 0 11111 10 0 11011 01 0 10111 01 0 10101
01 0 10111 00 1 11100 01 0 01000 10 1
centerbit 11111100 for similar documents on
popular bits for mid-similar documents
on popular bits
centerbit for original array 11011101 for
similar documents for mid-similar
documents
52
A
popularity threshold 0.61 ? 5.613.054
unpopularity threshold 0.3 ? 5.31.51
similarity threshold 0.8 ? 50.84
dissimilarity threshold 0.2 ? 5.21 diameter
threshold 0.8
centerbit 111 for similar documents on
popular bits 110 for mid-similar documents
on popular bits 01000111 for dissimilar documents
on popular bits
Due to the diameter for popular documents is 3/5
lt 0.8, more clustering
12568 47 3
1 3 4 5 6 2 7 8
11111 10 0 10111 00 1 11011 01 0 11111 10 0 10111
01 0 10101 01 0 11100 01 0 01000 10 1
Distance lt 1 (similar) 1lt x lt 3
(mid-similar)
centerbit for original array 111 for
popular documents 1101 for mid-popular
documents 01110010 for mid-popular documents
popular bits mid-popular bits
unpopular bits
53
12568 47 3
12568 47 3
12568 47 3
11111 10 0 10101 01 0 10111 00 1 11011 01 0 11111
10 0 10111 01 0 11100 01 0 01000 10 1
1 2 3 4 5 6 7 8
1 3 4 5 6 2 7 8
11111 10 0 10111 00 1 11011 01 0 11111 10 0 10111
00 0
1 3 4 5 6
11111 10 0 10111 00 1 11011 01 0 11111 10 0 10111
01 0 10101 01 0 11100 01 0 01000 10 1
54
For 6 data 6.61 3.16 4 6.3 1.8 1 For 5
popular bits 4.8 3.2 4 4.2 0.8 0
B
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2 diameter threshold
0.8
1,5,2,6, 3,4,7,8
Distance lt 0 (similar) 1 lt x lt 3
(mid-similar) gt 4 (dissimilar)
12568 47 3
1587 2364
1 5 4 6 2 3 7 8
6 2 3 4 7 8
11111 10 0 11111 10 0 11011 01 0 10111 01 0 10101
01 0 10111 00 1 11100 01 0 01000 10 1
1111 0010 1111 0000 1110 0110 1011 1010 1101
1000 0000 1101
centerbit 11111100 for similar documents on
popular bits for mid-similar documents
on popular bits
centerbit for original array 11011101 for
similar documents for mid-similar
documents
55
A
If popular bits are considered, 1,3,5,6, 4,
2,7, 8
Distance lt 1 (similar) 1lt x lt 3
(mid-similar)
1568 247 3
1 3 5 6 4
1111 110 0 1111 000 1 1111 110 0 1111 000 0 1011
101 0
popular bits mid-popular bits
unpopular bits
centerbit for original array 111 for
popular documents 1101 for mid-popular
documents 01110010 for mid-popular documents
56
Coverage
  • cover(x) is a set of objects that satisfy x.
  • cover(x) o o satisfies x
  • y is x if cover(x) ? cover(y)

57
Similarity in both Structure and Content
58
Bitmap index for paths
Bitmap index for pairs (path, value)
59
Semantics in Hierarchies
  • Topologies
  • Order in sibling
  • Order in depth (not just the depth number in ACM
    TIS 03)
  • Ex) toxics in pharmacy vs. toxics in weapon

60
Encoding Hierarchies
order/_at_customer 0 order/item 1
order/item/description 2 order/item/color 3
order/item/price 4 order/item/quantity 5
order/due 6 order/payment/method 7
order/payment/card_number 8
  • Pairs (Path, Value)

ltorder customero1gt ltitemgtCD
ltdescriptiongtCompact disklt/descriptiongt
ltpricegt9.99lt/pricegt ltquantitygt5lt/quantitygt
lt/itemgt ltitemgtDVD ltdescriptiongtpopular
appliance productlt/descriptiongt
ltcolorgtsilverlt/colorgt ltpricegt150.00lt/pricegt
ltquantitygt1lt/quantitygt lt/itemgt
ltduegt199.95lt/duegt ltpaymentgt ltmethodgtcredit
cardlt/methodgt ltcard_numbergt12345lt/card_numbergt
lt/paymentgt lt/ordergt
0 1 2 3 4 5 6 7 8
order1.xml order2.xml order3.xml
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0,14 0,15 1,2 1,7 1,10 1,12 2.0
2,4 2,6 2,8 2,9 3,10 3,13 7,3 7,5
7,6
order1.xml order2.xml order3.xml
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
1
0
0
1
1
1
0
1
0
0
order1.xml
61
root node
Tree-driven Bitmap Index
Vi(n1)

overflow node
leaf node
to Bitmap Index
62
Tree-driven Bitmap Index
Pi1ltVj1
Vj2ltPi1ltVj3
Vj1ltPi1ltVj2
leaf node
Pi2.Pk1ltVp1
Vp1ltPi2.Pk1ltVp2
doc1 doc2 . . dock
1
0
1
0
0
0
1
0
0
0
0
0
1
0
..
0
0
0
1
0
1
0
1
0
0
0
0
0
1
0
..
0
1
0
1
0
1
0
1
0
0
0
0
0
0
0
..
1
63
Incremental
123456789
9 10 11 12 13
11011101 11001111 11010000 01010001 01110111
centrobit for original 11011101 for 1,5
?9 100111 for 4,6 ? 10 110000 for 2,7
? 11 10101101 for 3 01110000 for 8 ? 12
011000 for 8,12 diameter (2/80.25) ??
01110111 for 13
64
Types of Incremental
  • Inserted into an existing cluster
  • Inserted into an existing cluster that can in
    turn be modified to a new cluster
  • Created a new cluster

65
Procedures
  • Consider database h, which consists of bits b and
    objects o.
  • Compute pop(b) ? labeling bits.
  • Compute sim(o) ? grouping objects
  • Set groups g. Computer center(g) ?Verify groups
  • If diameter(g) gt diameter_threshold
  • If g lt h
  • Then set database g (with corresponding b and o),
    and redo from 1.
  • If g ? h
  • Then relax labeling bits
  • By setting a to 1 at a time
  • Redo from 3.
  • Else, stop.

66
Another Relaxation
  • If members(g) gt member_threshold and diameter(g)
    ? diameter_threshold, stop
  • Set database g (with corresponding b and o), and
    redo from 1.

67
Performance
  • Fast ? grouping not by parsing all objects
  • Fast ? 1-pass computation of center not by
    picking or generating a center object
  • Flexible ?
  • Incremental ?

68
(No Transcript)
69
If both popular and mid-popular bits are
considered, 1,5, 3,6, 4, 2,7, 8
Distance lt 1 (similar) 2lt x lt 5
(mid-similar) gt (dissimilar)
1568 247 3
1 5 3 6 4
1111 110 0 1111 110 0 1111 000 1 1111 000 0 1011
101 0
popular bits mid-popular bits
unpopular bits
centerbit for original array 111 for
popular documents 1101 for mid-popular
documents 01110010 for mid-popular documents
70
(No Transcript)
71
popularity threshold 0.8 similarity threshold
0.8
40bits ----------------------------------?
(1212529)bits
72
45bits ----------------------------------?
(1264527)bits
73
(No Transcript)
74
Query Plan
75
wrong
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
11000000 11000000 10110000 10001000 10010000 10001
010 10000110 10000011
101000000 100010001 110000100 100100001 100001000
110000101 100010010 100100101 01010101
2 2 3 2 2 3 3 3
1 2 3 4 5 6 7 8
2 2 3 2 2 3 3 3
1 2 3 4 5 6 7 8
8 2 1 2 2 1 3 1
8 2 1 2 2 1 3 1
76
1 2 3 4 5 6 7 8
11000000 11000000 10110000 10001000 10010000 10001
010 10000110 10000011
2 2 3 2 2 3 3 3
1 2 3 4 5 6 7 8
8 2 1 2 2 1 3 1
77
A
If both popular and mid-popular bits are
considered, 1,5, 3,6, 4, 2,7, 8
Distance lt 1 (similar) 2lt x lt 5
(mid-similar) gt (dissimilar)
1568 247 3
1 5 3 6 4
1111 110 0 1111 110 0 1111 000 1 1111 000 0 1011
101 0
popular bits mid-popular bits
unpopular bits
centerbit for original array 111 for
popular documents 1101 for mid-popular
documents 01110010 for mid-popular documents
78
What if both popular and mid-popular bits are
considered from the begining
79
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2 diameter threshold
0.3
12568 47 3
Distance lt 2 (similar) 3lt x lt
5(mid-similar) gt 6 (dissimilar)
1 4 5 6 2 3 7 8
11111 10 0 11011 01 0 11111 10 0 10111 01 0 10101
01 0 10111 00 1 11100 01 0 01000 10 1
centrobit 111- for similar documents on pop
1,4,5,6, diameter (5/8.63) - for
similar documents on mid-pop 2,3,7,8, diameter
(8/81)
80
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.15 diameter
threshold 0.3
0.614 docs 2.44 if gt3 pop 0.331.2 if lt1
unpop 0.87 bits 5.6 if 1s gt 6 sim 0.157
1.05 if 1s lt1 unsim 0.86bits 4.8 if 1s gt5
sim 0.1560.9 if 4gt1s gt1 mid-sim
12568 47 3
Distance Centrobit 1s gt 6 (similar) 1111110
(diameter0) 2lt1s lt5 (mid-sim) 11101
(diameter(8-6)/80.25) 1s lt 1 (dissimilar)
1 5 4 6
11111 10 11111 10 11011 01 10111 01
Distance Centrobit 1s gt 5 (similar) 1lt1s
lt4 (mid-sim) (diameter(8-0)/81) 1s
lt 1 (dissimilar)
15 8273 46
2 3 7 8
11 1010 00 11 1001 01 11 0110 00 00 0101 10
Not recommendable!
81
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.15 diameter
threshold 0.3
0.614 docs 2.44 if gt3 pop 0.331.2 if lt1
unpop 0.87 bits 5.6 if 1s gt 6 sim 0.157
1.05 if 1s lt1 unsim 0.82bits 1.6 if 1s gt2
sim 0.1520.3 if 1s lt1 mid-sim
15 8273 46
Distance Centrobit 1s gt 2 (similar) 110
(diameter5/80.62) 1s lt 1 (mid-sim) 00010110
(diameter(8-8)/80) 1s lt 1 (dissimilar)
2 3 7 8
11 1010 00 11 1001 01 11 0110 00 00 0101 10
centrobit 1111110 for 1,5 11101 for
4,6 110 for 2,3,7 00010110 for 8
centrobit for original 11011101 for
1,5 100111 for 4,6 101 for
2,3,7 01110000 for 8
82
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.15 diameter
threshold 0.3
0.614 docs 2.44 if gt3 pop 0.331.2 if lt1
unpop 0.87 bits 5.6 if 1s gt 6 sim 0.157
1.05 if 1s lt1 unsim 0.82bits 1.6 if 1s gt2
sim 0.1520.3 if 1s lt1 mid-sim
Centrobit 1110 diameter4/80.5) ? 11100
diameter3/80.37) 11011000 (diameter0/80)
15 8273 46
2 3 7
11 1010 00 11 1001 01 11 0110 00
Centrobit 111100101 diameter0/80) 1110
(diameter4/80.5) ? 11000 diameter2/80.25)
15 8273 46
3 2 7
11 1001 01 11 1010 00 11 0110 00
centrobit 1111110 for 1,5 11101 for
4,6 110 for 2,3,7 ?centers1110 ,
1110 , 1110 , 1110 ,
1101 00010110 for 8
centrobit for original 11011101 for
1,5 100111 for 4,6 110000 for 2,7
10101101 for 3 01110000 for 8
83
(No Transcript)
84
(No Transcript)
85
p1 p2 p3 p4 p5 p6 p7 p8
d1 d2 d3
? 0 0 2 2 2 0 0 7
Trees of BitCube
Tree mask vector
p1 e1.e2 p2 e1.e3 p3 e1.e3.e5 p4
e1.e3.e6 p5 e1.e3.e7.e8 p6 e1.e4.e9 p7
e1.e4.e10 p8 e1.e4.e10.e11
86
p1 p2 p3 p4 p5 p6 p7 p8
d1 d2 d3
? 0 0 2 2 2 0 0 7
Trees of BitCube
Tree mask vector
p1 e1.e2 p2 e1.e3 p3 e1.e3.e5 p4
e1.e3.e6 p5 e1.e3.e7.e8 p6 e1.e4.e9 p7
e1.e4.e10 p8 e1.e4.e10.e11
87
? 0 0 2 2 2 0 0 7
Trees of d1, d2, d3
Tree mask vector
88
? 0 0 2 2 2 0 0 7
Trees of d1, d2, d3
Tree mask vector
e1
e2 e3 e3
e5 e5 e6 e6
(d4)
Write a Comment
User Comments (0)
About PowerShow.com