Title: Efficient Processing of XML Path Queries Using the Disk-based F
1Efficient Processing of XML Path Queries Using
the Disk-based FB Index
- Wei Wang
- University of New South Wales, Australia
With Hongzhi Wang (HIT), Hongjun Lu (HKUST),
Haifeng Jiang (IBM), Xuemin Lin (UNSW), Jianzhong
Li (HIT)
2XML Query Processing
- XML
- Modeled as a labeled tree
- Query by structural constraint
- Simple Path Queries, e.g., //Customer//Name
- Branching/Twig Queries, e.g., //Customer//Zipcode
//Name
3Index or Join?
Q1 /a/b
- Index-based approaches
- DataGuide, 1-index
- FB Index
- and a few approximate indexes
- Join-based approaches
- Structural join
- Twig join
a
b
b
Join-based approaches appear to be more actively
researched!
4Outline
- Introduction
- Disk-based FB Index
- Experiment
- Conclusions
5XML Structural Indexes
- Exact Indexes
- 1-index
- Based on backward bisimilarity
- Covers all simple path queries
- FB Index
- Based on backward and forward bisimilarity
- Covers all branching queries (optimally)
6A Running Example
Q1 /a/b
Q2 /a/bd
Q3 /a/bcd
7Problems with FB Index?
- Lack of scalability
- Usually large in practice
- No immediate solution when it cannot be
accommodated in memory - Unbalanced, all-leaf-nodes tree
- Naïve solutions (e.g., B-tree, pre-order
clustering in Lore, subtree clustering in Natix)
do not work well - Lack of efficiency
- Non-deterministic searching
- //-axis requires traversing the whole subtrees
- Much more costly when the index is not in the
memory
8Outline
- Introduction
- Disk-based FB Index
- Experiment
- Conclusions
9Disk-based FB Index
- Overcome the memory limit by putting FB index to
the disk - Naïve method does not work well
Q1 /a/b
10Basic Idea
- Moral Clustering is
- important
- Cluster by tag ? tape
- Cluster by parent ? segment block
- Cluster by 1-index ID ? chunk
- Benefits
- Optimized tree traversals
- Enable other intelligent algorithms
11Q1 /a/b
12Q.P. by Tree Traversal
- Dim 1 DFS/BFS
- Dim 2 Path/Branching Path
- Dim 3 / or //
Q5 /a/b/c
Q2 /a/bd
Q4 /a//c
Problem Still have to traverse the entire
subtrees to process //
13Q.P. by RangeFetch
(chunkID, tagName)
Q4 /a//c
Restriction Can only answer /p//q, where p is a
simple path.
14More Data Structures
- 3 more tapes
- Add region code for each d-node in the extents ?
Extents Tape - Use physical (start, end) codes
- Sort d-nodes according to (start, end)
- Add Doc Tape
- Add Value Tape
15Example
16SegSJ
- Key observation
- Structural relationship between two segments can
be inferred from the relationship between their
first d-nodes in their extent. - SegSJ(/p//q)
- R(s, e) ? A /p
- S(s, e) ? D //q
- Structural join R and S
- Using partition-based or sorting-based SJ
algorithm
b1 ? (10,78), (210, 297),
d1 ? (19,25), (54, 66),
Take the (s, e) of the first d-node in each
segment
17Outline
- Introduction
- Disk-based FB Index
- Experiment
- Conclusions
18Experiments
- Setup
- DBLP/XMark/TreeBank
- 8 representative queries
- Dim 1 PC/AD
- Dim 2 Path/Twig
- Dim 3 Large/Small
- DFS, BFS, RangeFetch, SegSJ
- NoK, TwigStack, Kaushiks algorithm in SIGMOD
04 - Metric time/PIO/LIO
19Varying Buffer Size (PC-Path)
20Varying Buffer Size (PC-Twig)
21Varying Buffer Size (AD-Path)
22Varying Buffer Size (AD-Twig)
23Buffer Hit Ratio
24Scalability
25Comparing with Other Systems
26Outline
- Introduction
- Disk-based FB Index
- Experiment
- Conclusions
27Conclusions
- Disk-based FB Index
- Store and cluster the index on the disk
- More efficient and intelligent query processing
algorithms - Demonstrated good scalability and query
efficiency - Expecting new query processing algorithms based
on index probing (in addition to join-based
approaches)
28QA
29Related Work
- Indexes
- Exact DataGuide, 1-index, FB Index
- Approx Approx. DataGuide, A(k)-index,
D(k)-index, M(k)-index - Join-based approaches
- Hybrid approach mixed-mode in VLDB 03
- Niagara
- VLDB 03 combines tree traversals joins
- SIGMOD 04 use 1-index to accelerate joins
- Clustering
- Lore pre-order
- Natix subtree