XRTree: Indexing Data for Efficient Structural Joins - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

XRTree: Indexing Data for Efficient Structural Joins

Description:

An internal node with m keys contain m 1 pointers pj pointing to the nodes in ... Keys should be chosen to minimize the size of the stab lists. ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 39
Provided by: jorge68
Category:

less

Transcript and Presenter's Notes

Title: XRTree: Indexing Data for Efficient Structural Joins


1
XR-Tree Indexing Data for Efficient Structural
Joins
  • By Jiang, Lu, Wang, and Ooi

Presented by Jorge Mena
7 / April / 2005
2
Roadmap
  • Motivation
  • Introduction
  • Concepts and Definitions
  • XR-Tree Structure
  • Updating an XR-Tree
  • Structural Joins using XR-Trees
  • Performance Study and Evaluation
  • Conclusions

3
Motivation
  • Join operations are one of the most useful
    operations in databases
  • Primary means to combine data
  • In XML, Structural Joins consume a large portion
    of time to evaluate path expressions
  • Recent proposals improve in accessing the data at
    most once, but still access some data
    unnecessarily

4
The Problem
  • We want to evaluate structural relationships
    between XML element sets efficiently.
  • For example paragraph // section returns all
    sections that are contained in each paragraph
  • How do we do this?
  • ? Scan the XML Document
  • ? Index the tags and find occurrences of
    relationships (Structural Joins)
  • Improve efficiency of Structural Joins

5
The Proposal
  • A dynamic external memory index structure
    specially designed for XML Data
  • Build an index on the region codes of the element
    nodes, which are of the from
  • (start, end)
  • Thus, XR-Tree XML Region Tree
  • Fully exploit the strictly nested property of
    XML.
  • Join two element sets using XR-Trees to skip
    elements that do not participate in the join.

6
Introduction
  • XML Data are commonly modeled by a tree structure

7
Introduction
  • To determine structural relationships, use a
    numbering scheme
  • Note that for two elements u and v, either
  • They are nested u.startltv.startltv.endltu.end
  • They dont intersect

8
Introduction
  • Structural Join is to find all occurrences of
    structural relationships between element sets

9
Concepts and Definitions
  • Definition 1
  • Given a key k and an element with region Ei (si,
    ei), k stabs Ei or Ei is stabbed by k if si k
    ei.
  • Given a set of ordered keys kj (0 j lt n), where
  • kx lt ky if x lt y, and an element Ei (si, ei), kj
    primarily stabs Ei or Ei is primarily stabbed by
    kj if
  • (1) kj lies in the interval (si, ei)
  • (2) kj is the smallest key that stabs Ei

10
Concepts and Definitions
  • For example
  • k0 stabs elements (s0, e0), (s1, e1), (s2, e2)
  • k0 primarily stabs interval (s0, e0)
  • k2 primarily stabs interval (s4, e4)

11
Concepts and Definitions
  • Definition 2
  • Given a set of ordered keys kj (0 j lt n), where
  • kx lt ky if x lt y, and a set of elements e
    Ui(si, ei), the stab list of a key kj is the list
    of elements in e that are stabbed by kj, denoted
    SLj or SL(kj).
  • The primary stabbed list, PSLj or PSL(kj), of a
    key kj is the list of elements that are primarily
    stabbed by kj.

12
Concepts and Definitions
  • For example
  • SL1 (s0, e0), (s3, e3)
  • PSL1 (s3, e3)
  • Note PSL3 0
  • Note strict ancestor-descendant relationships

13
Concepts and Definitions
  • Definition 3
  • Given kj,
  • psj start position for the first element of
    PSLj
  • pej end position for the first element of
    PSLj
  • These values are nil if PSLj 0.

14
Concepts and Definitions
  • For example
  • For k0 with PSL0 (s0, e0), (s1, e1), (s2,
    e2), (ps0, pe0) (s0, e0)
  • Note (ps3, pe3) (nil, nil)

15
XR-Tree Structure
  • An XR-Tree has the following properties
  • It is a balanced tree
  • An internal node contains m entries in the form
    (ki,psi,pei), with k0 lt k1 lt lt km-1, and d m
    2d where d is the degree of the tree.
  • An internal node with m keys contain m1 pointers
    pj pointing to the nodes in the next level of the
    tree.

16
XR-Tree Structure
  • More properties
  • Each internal node has a stab list SL(n) that
    holds all elements Ei such that Ei is stabbed by
    at least one key in n and not by any key of any
    ancestor of n. Each element is in the form
    (s,e,pointer).

17
XR-Tree Structure
  • More Properties
  • Leaf nodes contain element entries
    (s,e,InStabList,pointer).
  • Leaf nodes are linked from left to right

18
XR-Tree Structure Recap
  • Essentially a B tree
  • Complex index key entries
  • A Stab list associated with each internal node.
  • Keys should be chosen to minimize the size of the
    stab lists.
  • Use the value (as the internal key) that is
    smaller than the keys at the right branch.

19
Stab Lists
  • Contain elements (regions, intervals)
  • Each element is found in at most one stab list
  • Total elements in all stab lists total elements
    indexed
  • Maximum number of pages for a stab list
  • Smax hdBIfmax / BSfmin
  • hd max number of nesting
  • BI max number of entries per node
  • BS max number of tuples a stab list page can
    hold
  • f fill ratios

20
Stab Lists with high nesting
  • Smax is directly proportional to the nesting of
    the relationship.
  • Use a ps directory page that maps each psj to
    the location of the PSLj

21
Updating Insertion
  • Insertion is similar to that in B Trees, only
    maintain stub list
  • 1. Find a leaf page to insert
  • 2. Insert into leaf
  • 3. Insert into internal node
  • 4. Grow the tree taller

22
Updating Insertion with Overflow
  • In case of overflow, we need to split the stab
    list as well
  • However, the cost is constant because we only
    need to read the page where the split occurs
  • A new key k is proposed to the upper level with
    a stub list (StabSet) associated with it.
  • InStabList flags are updated accordingly.

23
Updating Insertion Cost
  • Theorem 1
  • The amortized I/O cost for inserting is
    O(logFNCDP)
  • N number of elements indexed
  • F Fanout of the XR-Tree
  • CDP the cost of one displacement of an element
    (removing it form one stabbed list and inserting
    it into another).

24
Updating Deletion
  • Similar as deletion in a B tree
  • 1. Find element, delete as you go
  • 2. Delete element from leaf page
  • 3. Delete entry from internal node
  • 4. Shorten the tree if necessary

25
Update Deletion Underflow
  • If deletion underflows, first we try to
    redistribute the keys in the internal nodes
  • Then redistribute the elements in the stub lists
    that are affected.
  • If need to merge, just link the SL lists from the
    two merging lists.
  • The cost of deletion is (Theorem 3)
  • O(logFN3CDP)

26
Cost of manipulating the Stub List
  • Maximum number of pages for a stab list
  • Smax hdBIfmax / BSfmin
  • Due to high hesting
  • CDP CSI CSD
  • CSI Cost of inserting
  • CSD Cost of deleting
  • Thanks to the use of ps directory pages,
    insertion and deletion cost are just a couple of
    I/Os

27
Structural Joins with XR-Trees
  • In order to do efficient structural joins, the
    authors defined two basic operations
  • Search for Descendants
  • Search for Ancestors
  • These basic operations are used with the proposed
    Stack based structural join to perform the
    desired join.

28
Searching for Descendants
  • Given an element Ea (sa,ea), find all the
    descendants of E such that
  • sa lt Ei.start lt ea
  • No need to access stub lists
  • Cost
  • O(logFNR/B)

29
Searching for Ancestors
  • Not as trivial as searching for Descendants
  • We want elements Ei that are ancestors of Ed such
    that Ei.start lt sd lt Ei.end
  • Elements stabbed by sd
  • But such elements are not stored in order, they
    are scattered in leaf pages to the left of the
    leaf page on the search path of sd.
  • Thats why we have stabbed lists!!!
  • Traversing from the root to the leaf pages,
    search the stub lists and find stubbed elements
    by sd. When at the leaves, output those elements
    stabbed by sd but not included in the stubbed
    lists of internal nodes.

30
Searching for Ancestors
  • To find stabbed elements by sd in internal nodes,
    just scan the PSL of the particular node
  • Note that we will not waste time scanning the PSL
    because of the condition pscltsdltpec

31
Searching for Ancestors
  • Again, to find ancestors, just scan the PSL lists
    at each internal node and output the results
  • At the leaves, just output those elements that
    are not in the intermediate nodes (because those
    are output already).

Cost O(logFNR) by Theorem 4
32
The Structural Join Algorithm using XR-Tree
Indexed data
  • Assume that input lists A and D (for AList and
    DList) are sorted by start position
  • Both sets are indexed by XR-Trees
  • Therefore leaf pages are sorted by start position
  • The algorithm proceeds just like Merge-Join but
    it effectively skips elements that do not
    participate in the join
  • Use any of the two algorithms to retrieve
    Ancestors or Descendants

33
The Structural Join Algorithm using XR-Tree
Indexed data
  • Loop until one list is empty
  • Keep a stack with ancestors that are potentially
    joinable with CurD

34
Performance Study and Evaluation
  • The players
  • No-index algorithm
  • B algorithm
  • XR-stack algorithm
  • The metrics
  • Number of elements scanned
  • Elapsed time

35
Varying Join Selectivity on Ancestors
Metric Elements scanned
36
Varying Join Selectivity on Descendants
  • Metric Elements Scanned

37
Varying Join Selectivity
  • Metric CPU Time

38
Conclusion
  • XR-Trees can support retrieval of all ancestors
    or descendants of an element E in an element set
    e indexed by an XR-Tree with an optimal worst
    case I/O cost.
  • A stack-based algorithm, XR-stack, is proposed.
  • A major improvement in structural joins with no
    index datasets or B-Tree indexed datasets.
Write a Comment
User Comments (0)
About PowerShow.com