Title: Twig2Stack: Bottomup Processing of GeneralizedTreePattern Queries over XML Documents
1Twig2Stack Bottom-up Processing of
Generalized-Tree-Pattern Queries over XML
Documents
- Songting Chen, Hua-Gang Li , Junichi Tatemura
- Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk
Candan - NEC Laboratories America
- University of California, Santa Barbara
2Background
- XML
- Hierarchical (tree) structured data
- Provide flexibility to model semi-structured data
- Widely accepted as universal data exchange format
- Query over XML
- XPath, XQuery W3C
- Extensively used by many applications
- Adopted by a number of commercial systems
3State-of-the-art XML Query Processing
Algebraic Approach
Binary Structure Joins Timber Large
intermediate results
Optimize multiple path expressions of XQuery
Chen, et. al Expensive post-processing
Holistic Approach
?
PathStack Bruno, et. al
TwigStack Bruno, et. al
Twig2Stack
4Processing Generalized Tree Pattern (GTP) Queries
Structural Joins
Structural Outer Joins
Grouping
Duplication Elimination
a1
A
//A//B
a2
B
b1
Our goal Avoid ALL these!
D
C
Sort
a1
XQuery FOR b in //AE/B, d in
b/D LET c b/C RETURN b, c, d
//A/B
a2
b2
b1
5Motivation PathStack Bruno et.al
a1
- Query //A//B Data
-
- Key observation minimize intermediate results
through compact representation of path matches,
by - Inter-node record AD relationship between
elements in different query nodes, e.g., b1?a2,
b2?a2 - Intra-node record AD relationship between
elements within the same query nodes, e.g., b1,
b2 - TwigStack Bruno et.al minimizes intermediate
results through - Output only those path matches that are in final
twig results - However, such optimality cannot be guaranteed
Choi, et.al - Not helpful for processing GTP queries
- Question can we minimize intermediate results
for twig queries through compact result encoding
(similar to PathStack)? - Useful for processing GTP queries as well?
b2
a2
a2
a1
b1
b1
SA
SB
b2
?
?
6 Hierarchical Stack Encoding
a1
a1
- Inter-node //A//B
- Can still use explicit edges
- Intra-node A
- Matching elements forms a tree structure as well
- Associate each query node with a hierarchical
stack - Push element e into hierarchical stack HSE iff
e satisfies the sub-twig query rooted at E - Matching can be determined when entire sub-tree
of e seen - Require post-order document traversal
a2
a2
a3
a4
a3
a4
HSA
7Twig2Stack Running Example
1,20, 1
a1
A
2,15, 2
16,19, 2
a2
b3
B
a2
17,18, 3
3,14, 3
C
D
d3
HSA
b1
12,13, 4
4,11, 4
c2
d1
5,10, 5
b2
b1
8, 9, 6
b2
6,7, 6
c1
d2
HSB
Merging Stacks
TwigStack needs to enumerate 3 matches for
//A/B//D and 2 for //A/B//C then join them
together. Twig2Stack requires neither path
joins nor path enumeration!
d1
c1
d2
c2
d3
HSC
HSD
8 GTP Result Enumeration
a4
- Bottom-up Computation .vs. Top-down Enumeration
- Visit Only those that are in the twig matches
- Handling grouping results
- Automatic grouping through Inter-node edges
- Handling duplicates and out-of-order results
- Problems coming from non-return nodes
- If D is return node while B is not
- b1 ? d1, d2, d3 and b2 ?d2, d3 (duplicates)
- Observation Intra-node hierarchy provides hints
b1
b2
d2
c2
c1
d3
d1
9 Experiment Setup
- Implementation
- Twig2Stack Java 1.4.2
- TwigStack, TJFast Java 1.4.2
- Kindly provided by Jiaheng Lu from National
University of Singapore (NUS) - Datasets
- XMark, DBLP, TreeBank
- Metrics
- Query processing time
- IO time
10Processing Full Twig Queries
Optimization of Query Processing TwigStack
Twig2Stack Optimization of IO TJFast
11 Not yet done Memory Usage
- Hierarchical Stack Encoding could hold entire
document in memory in the worst case - Unlike DOM approach, only matches need to be
stored - Tag match
- (Partial) twig match
- Predicate evaluation
- Early result enumeration dramatically reduces the
memory usage - Enumerate query results before the end of
document and release buffer - Main idea hybrid of top-down (PathStack) and
bottom-up (Twig2Stack) approaches
12 Early Result Enumeration (ERM)
- Enumerate results and release buffer when
elements in top-branch node are popped from
PathStack
A
1,20, 1
a1
a2
a1
B
2,15, 2
16,19, 2
a2
b3
C
D
17,18, 3
3,14, 3
d3
b1
12,13, 4
4,11, 4
c2
d1
5,10, 5
b2
8, 9, 6
6,7, 6
c1
d2
13 Memory Usage
dblp
Small sub-tree ?
article
title
year
site
open_auctions
Huge sub-tree ?
bid
reserve
bidder
increase
14Conclusions and Future Work
- Proposed a bottom-up GTP processing solution
- A twig encoding scheme
- A GTP enumeration algorithm that avoids any
post-processing operations - A hybrid scheme to reduce memory usage
- Future directions
- Handling worst case memory issues
- Optimizing IO cost by exploiting indexes
- Handling other axes, full XQuery, graph input
- Handling XML streams
15(No Transcript)
16Processing GTP
Optimization of non-return nodes
Automatic grouping