Efficient Processing of XML Path Queries Using the Disk-based F

1 / 29

About This Presentation

Title:

Efficient Processing of XML Path Queries Using the Disk-based F

Description:

University of New South Wales, Australia ... Niagara [VLDB 03] combines tree traversals joins [SIGMOD 04] use 1-index to accelerate joins ... –

Number of Views:61

Avg rating:3.0/5.0

Slides: 30

Provided by: drwei

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Processing of XML Path Queries Using the Disk-based F

1
Efficient Processing of XML Path Queries Using
the Disk-based FB Index

Wei Wang
University of New South Wales, Australia

With Hongzhi Wang (HIT), Hongjun Lu (HKUST),
Haifeng Jiang (IBM), Xuemin Lin (UNSW), Jianzhong
Li (HIT)
2
XML Query Processing

XML
Modeled as a labeled tree
Query by structural constraint
Simple Path Queries, e.g., //Customer//Name
Branching/Twig Queries, e.g., //Customer//Zipcode
//Name

3
Index or Join?
Q1 /a/b

Index-based approaches
DataGuide, 1-index
FB Index
and a few approximate indexes
Join-based approaches
Structural join
Twig join

a
b
b
Join-based approaches appear to be more actively
researched!
4
Outline

Introduction
Disk-based FB Index
Experiment
Conclusions

5
XML Structural Indexes

Exact Indexes
1-index
Based on backward bisimilarity
Covers all simple path queries
FB Index
Based on backward and forward bisimilarity
Covers all branching queries (optimally)

6
A Running Example
Q1 /a/b
Q2 /a/bd
Q3 /a/bcd
7
Problems with FB Index?

Lack of scalability
Usually large in practice
No immediate solution when it cannot be
accommodated in memory
Unbalanced, all-leaf-nodes tree
Naïve solutions (e.g., B-tree, pre-order
clustering in Lore, subtree clustering in Natix)
do not work well
Lack of efficiency
Non-deterministic searching
//-axis requires traversing the whole subtrees
Much more costly when the index is not in the
memory

8
Outline

Introduction
Disk-based FB Index
Experiment
Conclusions

9
Disk-based FB Index

Overcome the memory limit by putting FB index to
the disk
Naïve method does not work well

Q1 /a/b
10
Basic Idea

Moral Clustering is
important
Cluster by tag ? tape
Cluster by parent ? segment block
Cluster by 1-index ID ? chunk
Benefits
Optimized tree traversals
Enable other intelligent algorithms

11
Q1 /a/b
12
Q.P. by Tree Traversal

Dim 1 DFS/BFS
Dim 2 Path/Branching Path
Dim 3 / or //

Q5 /a/b/c
Q2 /a/bd
Q4 /a//c
Problem Still have to traverse the entire
subtrees to process //
13
Q.P. by RangeFetch

H(1, c) 3, 6

(chunkID, tagName)
Q4 /a//c
Restriction Can only answer /p//q, where p is a
simple path.
14
More Data Structures

3 more tapes
Add region code for each d-node in the extents ?
Extents Tape
Use physical (start, end) codes
Sort d-nodes according to (start, end)
Add Doc Tape
Add Value Tape

15
Example
16
SegSJ

Key observation
Structural relationship between two segments can
be inferred from the relationship between their
first d-nodes in their extent.
SegSJ(/p//q)
R(s, e) ? A /p
S(s, e) ? D //q
Structural join R and S
Using partition-based or sorting-based SJ
algorithm

b1 ? (10,78), (210, 297),
d1 ? (19,25), (54, 66),
Take the (s, e) of the first d-node in each
segment
17
Outline

Introduction
Disk-based FB Index
Experiment
Conclusions

18
Experiments

Setup
DBLP/XMark/TreeBank
8 representative queries
Dim 1 PC/AD
Dim 2 Path/Twig
Dim 3 Large/Small
DFS, BFS, RangeFetch, SegSJ
NoK, TwigStack, Kaushiks algorithm in SIGMOD
04
Metric time/PIO/LIO

19
Varying Buffer Size (PC-Path)
20
Varying Buffer Size (PC-Twig)
21
Varying Buffer Size (AD-Path)
22
Varying Buffer Size (AD-Twig)
23
Buffer Hit Ratio
24
Scalability
25
Comparing with Other Systems
26
Outline

Introduction
Disk-based FB Index
Experiment
Conclusions

27
Conclusions

Disk-based FB Index
Store and cluster the index on the disk
More efficient and intelligent query processing
algorithms
Demonstrated good scalability and query
efficiency
Expecting new query processing algorithms based
on index probing (in addition to join-based
approaches)

28
QA