Efficient Processing of XML Path Queries Using the Disk-based F

1 / 29
About This Presentation
Title:

Efficient Processing of XML Path Queries Using the Disk-based F

Description:

University of New South Wales, Australia ... Niagara [VLDB 03] combines tree traversals joins [SIGMOD 04] use 1-index to accelerate joins ... –

Number of Views:61
Avg rating:3.0/5.0
Slides: 30
Provided by: drwei
Category:

less

Transcript and Presenter's Notes

Title: Efficient Processing of XML Path Queries Using the Disk-based F


1
Efficient Processing of XML Path Queries Using
the Disk-based FB Index
  • Wei Wang
  • University of New South Wales, Australia

With Hongzhi Wang (HIT), Hongjun Lu (HKUST),
Haifeng Jiang (IBM), Xuemin Lin (UNSW), Jianzhong
Li (HIT)
2
XML Query Processing
  • XML
  • Modeled as a labeled tree
  • Query by structural constraint
  • Simple Path Queries, e.g., //Customer//Name
  • Branching/Twig Queries, e.g., //Customer//Zipcode
    //Name

3
Index or Join?
Q1 /a/b
  • Index-based approaches
  • DataGuide, 1-index
  • FB Index
  • and a few approximate indexes
  • Join-based approaches
  • Structural join
  • Twig join

a
b
b
Join-based approaches appear to be more actively
researched!
4
Outline
  • Introduction
  • Disk-based FB Index
  • Experiment
  • Conclusions

5
XML Structural Indexes
  • Exact Indexes
  • 1-index
  • Based on backward bisimilarity
  • Covers all simple path queries
  • FB Index
  • Based on backward and forward bisimilarity
  • Covers all branching queries (optimally)

6
A Running Example
Q1 /a/b
Q2 /a/bd
Q3 /a/bcd
7
Problems with FB Index?
  • Lack of scalability
  • Usually large in practice
  • No immediate solution when it cannot be
    accommodated in memory
  • Unbalanced, all-leaf-nodes tree
  • Naïve solutions (e.g., B-tree, pre-order
    clustering in Lore, subtree clustering in Natix)
    do not work well
  • Lack of efficiency
  • Non-deterministic searching
  • //-axis requires traversing the whole subtrees
  • Much more costly when the index is not in the
    memory

8
Outline
  • Introduction
  • Disk-based FB Index
  • Experiment
  • Conclusions

9
Disk-based FB Index
  • Overcome the memory limit by putting FB index to
    the disk
  • Naïve method does not work well

Q1 /a/b
10
Basic Idea
  • Moral Clustering is
  • important
  • Cluster by tag ? tape
  • Cluster by parent ? segment block
  • Cluster by 1-index ID ? chunk
  • Benefits
  • Optimized tree traversals
  • Enable other intelligent algorithms

11
Q1 /a/b
12
Q.P. by Tree Traversal
  • Dim 1 DFS/BFS
  • Dim 2 Path/Branching Path
  • Dim 3 / or //

Q5 /a/b/c
Q2 /a/bd
Q4 /a//c
Problem Still have to traverse the entire
subtrees to process //
13
Q.P. by RangeFetch
  • H(1, c) 3, 6

(chunkID, tagName)
Q4 /a//c
Restriction Can only answer /p//q, where p is a
simple path.
14
More Data Structures
  • 3 more tapes
  • Add region code for each d-node in the extents ?
    Extents Tape
  • Use physical (start, end) codes
  • Sort d-nodes according to (start, end)
  • Add Doc Tape
  • Add Value Tape

15
Example
16
SegSJ
  • Key observation
  • Structural relationship between two segments can
    be inferred from the relationship between their
    first d-nodes in their extent.
  • SegSJ(/p//q)
  • R(s, e) ? A /p
  • S(s, e) ? D //q
  • Structural join R and S
  • Using partition-based or sorting-based SJ
    algorithm

b1 ? (10,78), (210, 297),
d1 ? (19,25), (54, 66),
Take the (s, e) of the first d-node in each
segment
17
Outline
  • Introduction
  • Disk-based FB Index
  • Experiment
  • Conclusions

18
Experiments
  • Setup
  • DBLP/XMark/TreeBank
  • 8 representative queries
  • Dim 1 PC/AD
  • Dim 2 Path/Twig
  • Dim 3 Large/Small
  • DFS, BFS, RangeFetch, SegSJ
  • NoK, TwigStack, Kaushiks algorithm in SIGMOD
    04
  • Metric time/PIO/LIO

19
Varying Buffer Size (PC-Path)
20
Varying Buffer Size (PC-Twig)
21
Varying Buffer Size (AD-Path)
22
Varying Buffer Size (AD-Twig)
23
Buffer Hit Ratio
24
Scalability
25
Comparing with Other Systems
26
Outline
  • Introduction
  • Disk-based FB Index
  • Experiment
  • Conclusions

27
Conclusions
  • Disk-based FB Index
  • Store and cluster the index on the disk
  • More efficient and intelligent query processing
    algorithms
  • Demonstrated good scalability and query
    efficiency
  • Expecting new query processing algorithms based
    on index probing (in addition to join-based
    approaches)

28
QA
  • Thank You!

29
Related Work
  • Indexes
  • Exact DataGuide, 1-index, FB Index
  • Approx Approx. DataGuide, A(k)-index,
    D(k)-index, M(k)-index
  • Join-based approaches
  • Hybrid approach mixed-mode in VLDB 03
  • Niagara
  • VLDB 03 combines tree traversals joins
  • SIGMOD 04 use 1-index to accelerate joins
  • Clustering
  • Lore pre-order
  • Natix subtree
Write a Comment
User Comments (0)
About PowerShow.com