Twig2Stack: Bottomup Processing of GeneralizedTreePattern Queries over XML Documents

About This Presentation

Title:

Twig2Stack: Bottomup Processing of GeneralizedTreePattern Queries over XML Documents

Description:

Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents ... TwigStack [Bruno et.al] minimizes intermediate results through: ... – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 17

Provided by: neccorp

Category:

more less

Transcript and Presenter's Notes

Title: Twig2Stack: Bottomup Processing of GeneralizedTreePattern Queries over XML Documents

1
Twig2Stack Bottom-up Processing of
Generalized-Tree-Pattern Queries over XML
Documents

Songting Chen, Hua-Gang Li , Junichi Tatemura
Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk
Candan
NEC Laboratories America
University of California, Santa Barbara

2
Background

XML
Hierarchical (tree) structured data
Provide flexibility to model semi-structured data
Widely accepted as universal data exchange format
Query over XML
XPath, XQuery W3C
Extensively used by many applications
Adopted by a number of commercial systems

3
State-of-the-art XML Query Processing
Algebraic Approach
Binary Structure Joins Timber Large
intermediate results
Optimize multiple path expressions of XQuery
Chen, et. al Expensive post-processing
Holistic Approach
?
PathStack Bruno, et. al
TwigStack Bruno, et. al
Twig2Stack
4
Processing Generalized Tree Pattern (GTP) Queries
Structural Joins
Structural Outer Joins

Grouping
Duplication Elimination
a1
A
//A//B
a2
B
b1
Our goal Avoid ALL these!
D
C
Sort
a1
XQuery FOR b in //AE/B, d in
b/D LET c b/C RETURN b, c, d
//A/B
a2
b2
b1
5
Motivation PathStack Bruno et.al
a1

Query //A//B Data
Key observation minimize intermediate results
through compact representation of path matches,
by
Inter-node record AD relationship between
elements in different query nodes, e.g., b1?a2,
b2?a2
Intra-node record AD relationship between
elements within the same query nodes, e.g., b1,
b2
TwigStack Bruno et.al minimizes intermediate
results through
Output only those path matches that are in final
twig results
However, such optimality cannot be guaranteed
Choi, et.al
Not helpful for processing GTP queries
Question can we minimize intermediate results
for twig queries through compact result encoding
(similar to PathStack)?
Useful for processing GTP queries as well?

b2
a2
a2
a1
b1
b1
SA
SB
b2
?
?
6
Hierarchical Stack Encoding
a1
a1

Inter-node //A//B
Can still use explicit edges
Intra-node A
Matching elements forms a tree structure as well
Associate each query node with a hierarchical
stack
Push element e into hierarchical stack HSE iff
e satisfies the sub-twig query rooted at E
Matching can be determined when entire sub-tree
of e seen
Require post-order document traversal

a2
a2
a3
a4
a3
a4
HSA
7
Twig2Stack Running Example
1,20, 1
a1
A
2,15, 2
16,19, 2
a2
b3
B
a2
17,18, 3
3,14, 3
C
D
d3
HSA
b1
12,13, 4
4,11, 4
c2
d1
5,10, 5
b2
b1
8, 9, 6
b2
6,7, 6
c1
d2
HSB
Merging Stacks
TwigStack needs to enumerate 3 matches for
//A/B//D and 2 for //A/B//C then join them
together. Twig2Stack requires neither path
joins nor path enumeration!
d1
c1
d2
c2
d3
HSC
HSD
8
GTP Result Enumeration
a4

Bottom-up Computation .vs. Top-down Enumeration
Visit Only those that are in the twig matches
Handling grouping results
Automatic grouping through Inter-node edges
Handling duplicates and out-of-order results
Problems coming from non-return nodes
If D is return node while B is not
b1 ? d1, d2, d3 and b2 ?d2, d3 (duplicates)
Observation Intra-node hierarchy provides hints

b1
b2
d2
c2
c1
d3
d1
9
Experiment Setup

Implementation
Twig2Stack Java 1.4.2
TwigStack, TJFast Java 1.4.2
Kindly provided by Jiaheng Lu from National
University of Singapore (NUS)
Datasets
XMark, DBLP, TreeBank
Metrics
Query processing time
IO time

10
Processing Full Twig Queries
Optimization of Query Processing TwigStack
Twig2Stack Optimization of IO TJFast
11
Not yet done Memory Usage

Hierarchical Stack Encoding could hold entire
document in memory in the worst case
Unlike DOM approach, only matches need to be
stored
Tag match
(Partial) twig match
Predicate evaluation
Early result enumeration dramatically reduces the
memory usage
Enumerate query results before the end of
document and release buffer
Main idea hybrid of top-down (PathStack) and
bottom-up (Twig2Stack) approaches

12
Early Result Enumeration (ERM)

Enumerate results and release buffer when
elements in top-branch node are popped from
PathStack

A
1,20, 1
a1
a2
a1
B
2,15, 2
16,19, 2
a2
b3
C
D
17,18, 3
3,14, 3
d3
b1
12,13, 4
4,11, 4
c2
d1
5,10, 5
b2
8, 9, 6
6,7, 6
c1
d2
13
Memory Usage
dblp
Small sub-tree ?
article
title
year
site
open_auctions
Huge sub-tree ?
bid
reserve
bidder
increase
14
Conclusions and Future Work