Title: XRTree: Indexing Data for Efficient Structural Joins
1XR-Tree Indexing Data for Efficient Structural
Joins
- By Jiang, Lu, Wang, and Ooi
Presented by Jorge Mena
7 / April / 2005
2Roadmap
- Motivation
- Introduction
- Concepts and Definitions
- XR-Tree Structure
- Updating an XR-Tree
- Structural Joins using XR-Trees
- Performance Study and Evaluation
- Conclusions
3Motivation
- Join operations are one of the most useful
operations in databases - Primary means to combine data
- In XML, Structural Joins consume a large portion
of time to evaluate path expressions - Recent proposals improve in accessing the data at
most once, but still access some data
unnecessarily
4The Problem
- We want to evaluate structural relationships
between XML element sets efficiently. - For example paragraph // section returns all
sections that are contained in each paragraph - How do we do this?
- ? Scan the XML Document
- ? Index the tags and find occurrences of
relationships (Structural Joins) - Improve efficiency of Structural Joins
5The Proposal
- A dynamic external memory index structure
specially designed for XML Data - Build an index on the region codes of the element
nodes, which are of the from - (start, end)
- Thus, XR-Tree XML Region Tree
- Fully exploit the strictly nested property of
XML. - Join two element sets using XR-Trees to skip
elements that do not participate in the join.
6Introduction
- XML Data are commonly modeled by a tree structure
7Introduction
- To determine structural relationships, use a
numbering scheme
- Note that for two elements u and v, either
- They are nested u.startltv.startltv.endltu.end
- They dont intersect
8Introduction
- Structural Join is to find all occurrences of
structural relationships between element sets
9Concepts and Definitions
- Definition 1
- Given a key k and an element with region Ei (si,
ei), k stabs Ei or Ei is stabbed by k if si k
ei. - Given a set of ordered keys kj (0 j lt n), where
- kx lt ky if x lt y, and an element Ei (si, ei), kj
primarily stabs Ei or Ei is primarily stabbed by
kj if - (1) kj lies in the interval (si, ei)
- (2) kj is the smallest key that stabs Ei
10Concepts and Definitions
- For example
- k0 stabs elements (s0, e0), (s1, e1), (s2, e2)
- k0 primarily stabs interval (s0, e0)
- k2 primarily stabs interval (s4, e4)
11Concepts and Definitions
- Definition 2
- Given a set of ordered keys kj (0 j lt n), where
- kx lt ky if x lt y, and a set of elements e
Ui(si, ei), the stab list of a key kj is the list
of elements in e that are stabbed by kj, denoted
SLj or SL(kj). - The primary stabbed list, PSLj or PSL(kj), of a
key kj is the list of elements that are primarily
stabbed by kj.
12Concepts and Definitions
- For example
- SL1 (s0, e0), (s3, e3)
- PSL1 (s3, e3)
- Note PSL3 0
- Note strict ancestor-descendant relationships
13Concepts and Definitions
- Definition 3
- Given kj,
- psj start position for the first element of
PSLj - pej end position for the first element of
PSLj - These values are nil if PSLj 0.
14Concepts and Definitions
- For example
- For k0 with PSL0 (s0, e0), (s1, e1), (s2,
e2), (ps0, pe0) (s0, e0) - Note (ps3, pe3) (nil, nil)
15XR-Tree Structure
- An XR-Tree has the following properties
- It is a balanced tree
- An internal node contains m entries in the form
(ki,psi,pei), with k0 lt k1 lt lt km-1, and d m
2d where d is the degree of the tree. - An internal node with m keys contain m1 pointers
pj pointing to the nodes in the next level of the
tree.
16XR-Tree Structure
- More properties
- Each internal node has a stab list SL(n) that
holds all elements Ei such that Ei is stabbed by
at least one key in n and not by any key of any
ancestor of n. Each element is in the form
(s,e,pointer).
17XR-Tree Structure
- More Properties
- Leaf nodes contain element entries
(s,e,InStabList,pointer). - Leaf nodes are linked from left to right
18XR-Tree Structure Recap
- Essentially a B tree
- Complex index key entries
- A Stab list associated with each internal node.
- Keys should be chosen to minimize the size of the
stab lists. - Use the value (as the internal key) that is
smaller than the keys at the right branch.
19Stab Lists
- Contain elements (regions, intervals)
- Each element is found in at most one stab list
- Total elements in all stab lists total elements
indexed - Maximum number of pages for a stab list
- Smax hdBIfmax / BSfmin
- hd max number of nesting
- BI max number of entries per node
- BS max number of tuples a stab list page can
hold - f fill ratios
20Stab Lists with high nesting
- Smax is directly proportional to the nesting of
the relationship. - Use a ps directory page that maps each psj to
the location of the PSLj
21Updating Insertion
- Insertion is similar to that in B Trees, only
maintain stub list - 1. Find a leaf page to insert
- 2. Insert into leaf
- 3. Insert into internal node
- 4. Grow the tree taller
22Updating Insertion with Overflow
- In case of overflow, we need to split the stab
list as well - However, the cost is constant because we only
need to read the page where the split occurs - A new key k is proposed to the upper level with
a stub list (StabSet) associated with it. - InStabList flags are updated accordingly.
23Updating Insertion Cost
- Theorem 1
- The amortized I/O cost for inserting is
O(logFNCDP) - N number of elements indexed
- F Fanout of the XR-Tree
- CDP the cost of one displacement of an element
(removing it form one stabbed list and inserting
it into another).
24Updating Deletion
- Similar as deletion in a B tree
- 1. Find element, delete as you go
- 2. Delete element from leaf page
- 3. Delete entry from internal node
- 4. Shorten the tree if necessary
25Update Deletion Underflow
- If deletion underflows, first we try to
redistribute the keys in the internal nodes - Then redistribute the elements in the stub lists
that are affected. - If need to merge, just link the SL lists from the
two merging lists. - The cost of deletion is (Theorem 3)
- O(logFN3CDP)
26Cost of manipulating the Stub List
- Maximum number of pages for a stab list
- Smax hdBIfmax / BSfmin
- Due to high hesting
- CDP CSI CSD
- CSI Cost of inserting
- CSD Cost of deleting
- Thanks to the use of ps directory pages,
insertion and deletion cost are just a couple of
I/Os
27Structural Joins with XR-Trees
- In order to do efficient structural joins, the
authors defined two basic operations - Search for Descendants
- Search for Ancestors
- These basic operations are used with the proposed
Stack based structural join to perform the
desired join.
28Searching for Descendants
- Given an element Ea (sa,ea), find all the
descendants of E such that - sa lt Ei.start lt ea
- No need to access stub lists
- Cost
- O(logFNR/B)
29Searching for Ancestors
- Not as trivial as searching for Descendants
- We want elements Ei that are ancestors of Ed such
that Ei.start lt sd lt Ei.end - Elements stabbed by sd
- But such elements are not stored in order, they
are scattered in leaf pages to the left of the
leaf page on the search path of sd. - Thats why we have stabbed lists!!!
- Traversing from the root to the leaf pages,
search the stub lists and find stubbed elements
by sd. When at the leaves, output those elements
stabbed by sd but not included in the stubbed
lists of internal nodes.
30Searching for Ancestors
- To find stabbed elements by sd in internal nodes,
just scan the PSL of the particular node - Note that we will not waste time scanning the PSL
because of the condition pscltsdltpec
31Searching for Ancestors
- Again, to find ancestors, just scan the PSL lists
at each internal node and output the results - At the leaves, just output those elements that
are not in the intermediate nodes (because those
are output already).
Cost O(logFNR) by Theorem 4
32The Structural Join Algorithm using XR-Tree
Indexed data
- Assume that input lists A and D (for AList and
DList) are sorted by start position - Both sets are indexed by XR-Trees
- Therefore leaf pages are sorted by start position
- The algorithm proceeds just like Merge-Join but
it effectively skips elements that do not
participate in the join - Use any of the two algorithms to retrieve
Ancestors or Descendants
33The Structural Join Algorithm using XR-Tree
Indexed data
- Loop until one list is empty
- Keep a stack with ancestors that are potentially
joinable with CurD
34Performance Study and Evaluation
- The players
- No-index algorithm
- B algorithm
- XR-stack algorithm
- The metrics
- Number of elements scanned
- Elapsed time
35Varying Join Selectivity on Ancestors
Metric Elements scanned
36Varying Join Selectivity on Descendants
37Varying Join Selectivity
38Conclusion
- XR-Trees can support retrieval of all ancestors
or descendants of an element E in an element set
e indexed by an XR-Tree with an optimal worst
case I/O cost. - A stack-based algorithm, XR-stack, is proposed.
- A major improvement in structural joins with no
index datasets or B-Tree indexed datasets.