Efficient Processing of Updates in Dynamic XML Data - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Processing of Updates in Dynamic XML Data

Description:

A compact dynamic binary string encoding (CDBS) ... Further, we need to propose a Compact Dynamic Binary String encoding, called CDBS. ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 42
Provided by: hl83
Category:

less

Transcript and Presenter's Notes

Title: Efficient Processing of Updates in Dynamic XML Data


1
Efficient Processing of Updates in Dynamic XML
Data
  • Changqing Li, Tok Wang Ling, Min Hu

2
Outline
  • Background and related work
  • Our proposals
  • Lexicographical order
  • A compact dynamic binary string encoding (CDBS)
  • Applying CDBS to different labeling schemes for
    update processing
  • Experimental evaluation
  • Conclusion

3
Background and related work Labeling schemes
  • Three main categories of labeling schemes to
    process XML queries
  • (1) Containment labeling scheme Zhang et al
    SIGMOD01 etc.
  • (2) Prefix labeling scheme Tatarinov et al
    SIGMOD02 etc.
  • (3) Prime number labeling scheme Wu et al
    ICDE04
  • In this talk, we focus on the labeling schemes to
    efficiently process updates

4
(1) Containment scheme
  • Each node is assigned with three values, i.e.
    start, end, and level
  • Based on start, end, and level to determine
    different relationships

5
Containment is bad to process updates
  • Need to re-label all the ancestor nodes and all
    the nodes after the inserted node in document
    order

6
Containment is bad to process updates
  • Need to re-label all the ancestor nodes and all
    the nodes after the inserted node in document
    order

7
Existing approaches to process the updates in
containment scheme
  • Increase the interval size and leave some values
    unused for the future insertions Li et al
    VLDB01
  • When unused values are used up, have to re-label
  • Use float-point value Amagasa et al ICDE03
  • Float-point value represented in a computer with
    a fixed number of bits
  • Due to float-point precision, have to re-label
  • They both can not avoid the re-labeling

8
(2) Prefix scheme
  • Three main prefix schemes
  • DeweyID Tatarinov et al SIGMOD02
  • BinaryString Cohen et al PODS02
  • OrdPath O'Neil et al SIGMOD04

9
DeweyID (Cont.)
  • Determine different relationships based on the
    prefix property

10
DeweyID is bad to process order-sensitive updates
  • Order-sensitive updates to maintain the
    document order when updates are performed
  • Need to re-label all the sibling nodes after the
    inserted node and all the descendants of these
    siblings

11
DeweyID is bad to process order-sensitive updates
  • Order-sensitive updates to maintain the
    document order when updates are performed
  • Need to re-label all the sibling nodes after the
    inserted node and all the descendants of these
    siblings

12
Existing approaches to process the updates in
prefix scheme OrdPath
  • OrdPath O'Neil et al SIGMOD04
  • Similar to DeweyID
  • But at the beginning, use odd numbers only

13
Existing approaches to process the updates in
prefix scheme OrdPath
  • OrdPath

Label of node a -1 Label of node b
4.1 Label of node c 4.3 Label of node d
4.2.1 They are siblings, but their labels look
very different
1
5
3
7
a
b
d
c
3.1
3.3
7.3
7.1
14
(3) Prime number scheme Wu et al ICDE04
  • Prime re-calculate the SC value to maintain the
    document order instead of re-labeling.
  • But re-calculation is much more expensive.

15
Our CDBS encoding
  • (1) Lexicographical order
  • (2) Encoding
  • (3) Applications and processing of updates
  • (4) Experimental results

16
(1) Lexicographical order of binary string
  • Given two binary strings 0011 and 01, 0011
    01 lexicographically because the comparison is
    from left to right, and the 2nd bit of 0011 is
    0, while the 2nd bit of 01 is 1.
  • 0011 lt 01
  • Given two binary strings 01 and 0101, 01
    0101 lexicographically because 01 is a prefix
    of 0101.
  • 01 lt 0101

17
Find a binary string between two binary strings
lexicographically
  • To insert a binary string between 0011 and 01
  • the size of 0011 is 4 which is larger than the
    size 2 of 01 this is Case (a) (larger than or
    equal)
  • therefore we directly concatenate one more 1
    after 0011.
  • The inserted binary string is 00111, and
  • 0011 lt 00111 lt 01
    lexicographically.
  • To insert a binary string between 01 and 0101
  • the size of 01 is 2 which is smaller than the
    size 4 of 0101 this is Case (b) (smaller than)
  • therefore we change the last bit 1 of 0101 to
    01, i.e. the inserted binary string is 01001
  • 01 lt 01001 lt 0101
    lexicographically.

18
(2) Compact encoding
  • Achieved the dynamic objective.
  • Further, we need to propose a Compact Dynamic
    Binary String encoding, called CDBS.

19
Example illustration of CDBS
  • We show how to encode 18 numbers based on our
    CDBS encoding
  • This is only an example, any other numbers can be
    encoded with our CDBS

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(3) Applying CDBS to the containment scheme
  • Replace the start and end values 1 to 18 with
    our CDBS encoding
  • Based on the lexicographical order comparison
  • Level is still the same

29
Applying CDBS to the prefix scheme
  • The CDBS codes for 4 numbers are
  • 001, 01, 1 and 11.
  • The CDBS codes for 2 numbers are
  • 01 and 1.

30
Applying CDBS to the prime scheme
  • Store the document order with our CDBS codes.
  • Based on the lexicographical order to determine
    the orders of nodes.
  • The size of Prime and the query performance of
    Prime are bad, so we do not show the details.

31
Processing updates based on CDBS for containment
scheme
  • To insert two binary strings between 0011 and
    01, the inserted two binary strings will be
    00111 and 001111.
  • The complete label of the inserted node is
    00111,001111,3
  • No need to re-label the existing nodes, but
    different relationships, e.g. ancestor-descendant
    etc., can be determined, and the orders can be
    kept.

32
Processing updates based on CDBS for prefix
scheme
  • To insert a binary string before 01, the
    inserted binary string will be 001
  • The complete label of the inserted node is
    01.001
  • No need to re-label the existing nodes, but
    different relationships, e.g. ancestor-descendant
    etc., can be determined, and the orders can be
    kept.

33
Problem about CDBS
  • The size of V-CDBS and F-CDBS may encounter the
    overflow problem when many nodes are inserted.
  • To solve the overflow problem, we propose QED in
    Li Ling CIKM05
  • QED uses four quaternary symbols, i.e. 0, 1, 2,
    and 3, and each is stored with 2 bits
  • 0 is used as the separator or delimiter, and it
    will never encounter the overflow problem
  • QED is not as compact as CDBS, update cost is
    higher

34
(4) Experimental results
  • Experimental setup
  • Performance study on static XML
  • Performance study on updates

35
Experimental setup
  • All the schemes are implemented in Java and all
    the experiments are carried out on a 3.0 GHz
    Pentium 4 processor with 1 GB RAM running Windows
    XP Professional.

36
Experimental setup (cont.)
  • The following table shows the datasets we used.

37
Performance study on static XML
  • Our V-CDBS and F-CDBS are the most compact
    variable and fixed length dynamic encoding

Label sizes of different schemes
38
The 5 cases of node updates in experiments
  • We select one XML file Hamlet in dataset D1 to
    test the update performance (it is similar for
    other XML files).
  • Hamlet has 5 act elements. We test the following
    5 cases
  • inserting an act element before act1,
  • inserting an act element before act2,
  • ,
  • and inserting an act element before act5.

39
Number of nodes to re-label in updates
40
Total time for node updates
  • Several nodes inserted, main time is the I/O
    time, our approaches are the best to process
    updates.
  • When considering processing time only, our
    approaches are much better, more than 300 times
    faster. More appropriate for updates with many
    nodes.

Log2(Update time) of different schemes
41
Conclusion
  • Our CDBS is dynamic
  • Our CDBS is the most compact
  • Update cost is the cheapest, only need to modify
    the last 1 bit of the neighbor label
Write a Comment
User Comments (0)
About PowerShow.com