Title: Complex Group-By Queries for XML
1Complex Group-By Queries for XML
- C. Gokhale, N. Gupta, P. Kumar, L.V.S.
Lakshmanan, R. Ng, and B.A. Prakash - University of British Columbia, Vancouver
- Indian Institute of Technology, Bombay
- Cornell University, Ithaca
2Why this paper?
- Of late, there is considerable interest in
analytical processing of XML data. - Even for data integration, there is a compelling
need for performing various group-by style
aggregate operations. - A core operator needed for analytics is the
group-by operator, which is widely used in
relational as well as OLAP database applications.
3Why this paper?
- XQuery requires group-by operations to be
simulated using nesting. - State of current research
- Provide support for grouping at the logical or
physical level and recognize grouping operations
from nested queries and rewrite them with
grouping operations - Extend XQuery FLWOR expressions with explicit
constructs similar to the group-by, order-by and
having clauses in SQL - Direct algorithmic support for a group-by
operator is not explored.
4Overview
- Nested Aggregation
- Expressing complex aggregation queries
- Having clause
- Moving Windows
- Related Work
- Algorithm MERGE-GB
- Extensions for disk-based
- Dealing with nesting
- Dealing with having clause
- Dealing with moving windows
- Experimental results
5Syntax
Median of Sold
Outer grouping attribute
Average of price
Inner grouping attribute
6Syntax
7Syntax
- group //Book
- by //Name return (
- //Name, avg(/Price), count()
- then by /Year return (
- /Year, median(/Sold)
- )
- )
8General Framework
- a is an absolute XPath expression, while ßs and
?s are relative to a. - aggouts and aggins are aggregation operations
such as min(), count(), avg(), median(). - Nested aggregations medianMax(), minCount()
- Cons, AggConsout and AggConsin are sets of
conditions. - mwouts and mwins denote moving window
specifications. - mwi (width, step, winType, domType)
9Nested Aggregation Features
- Having Clause
- aggi(?i) ?i ci, where ?i in , ?, gt,lt,, , and
ci is a constant. - Moving Windows
- (width, step, winType, domType)
- group //Book
- by //Name
- having count() 100 return (
- //Name, avg(/Price), count()
- then by /Year return (
- /Year(10,5), median(/Sold)
- )
- )
10Related Work
- 2, 3 give syntactic extensions to FLOWR
expressions - Discuss related analytics (moving window
aggregations and cube) - Do NOT discuss algorithms for directly computing
group-bys - Support for group-by at logical/physical level
6 - 4, 5, 9, 12 detect group-bys from nested
queries - Challenging and queries are hard to express and
understand. - Group-by and cube queries for relational data 8,
10. - Algorithms not directly applicable to
hierarchical data - Group-by elements involve combination of forward
and backward axes - Aggregations on values nested and occur at
multiple levels
11Merge-GB
- Overview
- No support for nesting, having, moving windows
- Three phase process
- Initialization
- Prune nodes other than a, ß and ? result in a
Canonical Tree - Counter Initialization Associate counter with
edge based on aggregation operator - Count() counter is a scalar with value 1
- Spread() counter is a range of type min, max
- Median() counter is a frequency table
- Merge Phase
- Answer Extraction
- Running query
- group //Book by //Year
- return Year, Median(Sold), MedianCount(Author),
Spread(price), count()
12Merge-GB Initialization
Catalogue
Subject
Subject
Music CD
Book
Name
Book
Book
Name
Book
PubInfo
Title Mechanics
Author Smith
Year 1999
Sold 5600
Price 60
Author Newton
PubInfo
Title Sound Waves
Year 2000
Sold 900
Price 96
Author McManus
PubInfo
Title Carbon Comps
Year 1999
Sold 8000
Price 25
Author Johnson
Name Kaufman
City NY
Name Wesley
City LA
Name Kaufman
City NY
13Merge-GB Initialization
1
1
1
1
Book
Book
Book
Book
60,60
96,96
25,25
1
1
1
(5600,1)
1
(900,1)
(8000,1)
Author
Year 1999
Sold
Price
Author
Year 2000
Sold
Price
Author
Year 1999
Sold
Price
Author
14Merge-GB Initialization
1
1
1
1
Book (2000)
Book (1999)
Book (1999)
Book
60,60
96,96
25,25
1
1
1
(5600,1)
1
(900,1)
(8000,1)
Author
Sold
Price
Author
Sold
Price
Author
Sold
Price
Author
15Merge-GB Merge Phase
- Repeated merging of nodes
- Based on equality of associated values (eff
group-by labels) - Counters are updated
- Child nodes are merged analogously
16Merge-GB Merge Phase
1
1
1
1
Book (2000)
Book (1999)
Book (1999)
Book
60,60
96,96
25,25
1
1
1
lt5600,1gt
1
(900,1)
lt8000,1gt
Author
Sold
Price
Author
Sold
Price
Author
Sold
Price
Author
17Merge-GB Merge Phase
112
1
1
1
Book (2000)
Book (1999)
Book
Book (1999)
60,60
96,96
25,25
1
1
1
lt5600,1gt
1
(900,1)
lt8000,1gt
Author
Sold
Price
Author
Sold
Price
Author
Sold
Price
Author
1
25,25
lt8000,1gt
Sold
Price
Author
Merge ( 25,25, 60,60 ) 25,60
18Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
1
1
lt5600,1gt
1
(900,1)
Author
Sold
Price
Author
Sold
Price
Author
1
lt8000,1gt
Sold
Author
Merge ( lt5600,1gt, lt8000,1gt ) (lt5600,1gtlt8000,1gt)
19Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
1
1
lt5600,1gt lt8000,1gt
1
(900,1)
Author
Sold
Price
Author
Sold
Price
Author
1
Author
What about Author? Can these nodes be merged
analogously?
20Merge-GB Merge Phase
- Repeated merging of nodes
- Based on equality of associated values (eff
group-by labels) - Counters are updated
- Child nodes are merged analogously
- Different types of merges
- Siblings vs Non-siblings
- Primarily to handle nested aggregations
21Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
1
1
lt5600,1gt lt8000,1gt
1
(900,1)
Author
Sold
Price
Author
Sold
Price
Author
1
Author
MedianCount Median of Count Count requires
simple addition
22Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
2
1
lt5600,1gt lt8000,1gt
(900,1)
Sold
Price
Author
Sold
Price
Author
1
Author
MedianCount Median of Count Median requires a
frequency count
23Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
lt1,1gt lt2,1gt
1
lt5600,1gt lt8000,1gt
(900,1)
Sold
Price
Author
Sold
Price
Author
24Merge-GB Nesting
- Processing ß nodes
- Create group corresponding to each new value.
- Inner ß is processed analogously create a group
with label ßout, ßin. - For group inversion, create a dummy a node and
update later. - Processing ? nodes/ updating counters
- Holistic collect all ? values corresponding to
that ß. - In memory if possible otherwise dump to disk in
?-file - Otherwise compute aggregation on-the-fly
- Siblings or non-siblings? Native XML structure
allows decidability using exactly one integer per
answer node!
25Merge-GB Having clause
- Anti-monotonic early pruning
- Constraint remains false once violated
- Helps in early pruning, for unnested queries
- Nested Queries
- Associate a flag with each ß group
- ? calculations can be avoided for groups with
negative flags - Processing of inner-nests can also be avoided
- Inner block not processed until the having
clauses in all the outer blocks have been
processed - Separate ?files for each holistic calculation
26Merge-GB Moving Windows
- mw (width, step, winType, domType)
- Repeated-aggregation strategy
- Enumerate each group apriori
- Aggregate for each group treating them
independently - Extend the notion of hash(ß) to update all
appropriate counters - Example
- Year (WS 5, Step 2, Standard Domain, Incremental)
- 1990-1994
- 1992-1996
- 1994-1998
Val20
Val50
1993, Price30
1990, Price20
Val30
27Merge-GB Moving Windows
- mw (width, step, winType, domType)
- Rolling-over strategy
- Construct a window with step1, width1
- Results in an intermediate answer tree
- Compute moving window aggregate after computing
aggregate for first window, remove starting
elements and add elements for the subsequent
window
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
Val20
Val30
1990, Price20
1993, Price30
28Merge-GB Additional Details
- Multiple moving windows
- Result in hyper-rectangular windows
- Essentially performs a cartesian-product over all
involved attributes - Moving windows and Having clause
- Repeated aggregation requires no change
- Rolling-over with holistic aggregation requires
delay in processing of ?files. - Properties
- Lemma 1 With the aforementioned setup, a value
in an inner block that does not appear in the
answer is not read after the value was written
into the appropriate gamma file. - Lemma 2 For each value of , the rolling-over
strategy guarantees that aggregation is done at
most once.
29Performance Evaluation
- Setup
- 2GHz CPU, 1GB RAM machine.
- Datasets
- XMARK (50-500MB)
- DBLP (250-400MB)
- Protein (13MB)
- Comparison with Galax and Qizx
- Evaluation Parameters
- Size and number of groups
- Fully vs. Partially specified paths
- Increasing levels of nesting
- Scalability
30Performance Evaluation
Q1 Small number of large groups Q2 Large number
of small groups N-GB is not affected while Qizx
cannot handle a large number of groups
31Performance Evaluation
Q5 Query with fully specified paths Q6 Query
with partially specified paths N-GB is not
affected while Qizx cannot handle partially
specified paths
32Performance Evaluation
N-GB is linear as expected, whereas Qizx becomes
exponential with increasing levels of nesting
33Performance Evaluation
Early pruning gives manifold improvement for
queries with having clause(s)
34Performance Evaluation
Increase in window width/stop size results in
more overlapping computations and therefore gives
more gain
35Performance Evaluation
Early pruning provides the maximum benefit,
while repeated aggregations degrades performance
36Conclusions
- Features to express complex aggregation queries
on XML data - Efficient disk-based algorithm to compute
group-by queries - Three-phase process
- Stable, scalable, efficient
- Natural support for various optimizations
- Experimental results to affirm the above
37Thank You ?