Complex Group-By Queries for XML - PowerPoint PPT Presentation

About This Presentation
Title:

Complex Group-By Queries for XML

Description:

Complex Group-By Queries for XML C. Gokhale+, N. Gupta#, P. Kumar+, L.V.S. Lakshmanan*, R. Ng*, and B.A. Prakash+ * University of British Columbia, Vancouver – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 38
Provided by: Nitin59
Category:
Tags: xml | complex | group | queries | scalar | waves

less

Transcript and Presenter's Notes

Title: Complex Group-By Queries for XML


1
Complex Group-By Queries for XML
  • C. Gokhale, N. Gupta, P. Kumar, L.V.S.
    Lakshmanan, R. Ng, and B.A. Prakash
  • University of British Columbia, Vancouver
  • Indian Institute of Technology, Bombay
  • Cornell University, Ithaca

2
Why this paper?
  • Of late, there is considerable interest in
    analytical processing of XML data.
  • Even for data integration, there is a compelling
    need for performing various group-by style
    aggregate operations.
  • A core operator needed for analytics is the
    group-by operator, which is widely used in
    relational as well as OLAP database applications.

3
Why this paper?
  • XQuery requires group-by operations to be
    simulated using nesting.
  • State of current research
  • Provide support for grouping at the logical or
    physical level and recognize grouping operations
    from nested queries and rewrite them with
    grouping operations
  • Extend XQuery FLWOR expressions with explicit
    constructs similar to the group-by, order-by and
    having clauses in SQL
  • Direct algorithmic support for a group-by
    operator is not explored.

4
Overview
  • Nested Aggregation
  • Expressing complex aggregation queries
  • Having clause
  • Moving Windows
  • Related Work
  • Algorithm MERGE-GB
  • Extensions for disk-based
  • Dealing with nesting
  • Dealing with having clause
  • Dealing with moving windows
  • Experimental results

5
Syntax
Median of Sold
Outer grouping attribute
Average of price
Inner grouping attribute
6
Syntax
7
Syntax
  • group //Book
  • by //Name return (
  • //Name, avg(/Price), count()
  • then by /Year return (
  • /Year, median(/Sold)
  • )
  • )

8
General Framework
  • a is an absolute XPath expression, while ßs and
    ?s are relative to a.
  • aggouts and aggins are aggregation operations
    such as min(), count(), avg(), median().
  • Nested aggregations medianMax(), minCount()
  • Cons, AggConsout and AggConsin are sets of
    conditions.
  • mwouts and mwins denote moving window
    specifications.
  • mwi (width, step, winType, domType)

9
Nested Aggregation Features
  • Having Clause
  • aggi(?i) ?i ci, where ?i in , ?, gt,lt,, , and
    ci is a constant.
  • Moving Windows
  • (width, step, winType, domType)
  • group //Book
  • by //Name
  • having count() 100 return (
  • //Name, avg(/Price), count()
  • then by /Year return (
  • /Year(10,5), median(/Sold)
  • )
  • )

10
Related Work
  • 2, 3 give syntactic extensions to FLOWR
    expressions
  • Discuss related analytics (moving window
    aggregations and cube)
  • Do NOT discuss algorithms for directly computing
    group-bys
  • Support for group-by at logical/physical level
    6
  • 4, 5, 9, 12 detect group-bys from nested
    queries
  • Challenging and queries are hard to express and
    understand.
  • Group-by and cube queries for relational data 8,
    10.
  • Algorithms not directly applicable to
    hierarchical data
  • Group-by elements involve combination of forward
    and backward axes
  • Aggregations on values nested and occur at
    multiple levels

11
Merge-GB
  • Overview
  • No support for nesting, having, moving windows
  • Three phase process
  • Initialization
  • Prune nodes other than a, ß and ? result in a
    Canonical Tree
  • Counter Initialization Associate counter with
    edge based on aggregation operator
  • Count() counter is a scalar with value 1
  • Spread() counter is a range of type min, max
  • Median() counter is a frequency table
  • Merge Phase
  • Answer Extraction
  • Running query
  • group //Book by //Year
  • return Year, Median(Sold), MedianCount(Author),
    Spread(price), count()

12
Merge-GB Initialization
Catalogue
Subject
Subject
Music CD
Book
Name
Book
Book
Name
Book
PubInfo
Title Mechanics
Author Smith
Year 1999
Sold 5600
Price 60
Author Newton
PubInfo
Title Sound Waves
Year 2000
Sold 900
Price 96
Author McManus
PubInfo
Title Carbon Comps
Year 1999
Sold 8000
Price 25
Author Johnson
Name Kaufman
City NY
Name Wesley
City LA
Name Kaufman
City NY
13
Merge-GB Initialization
1
1
1
1
Book
Book
Book
Book
60,60
96,96
25,25
1
1
1
(5600,1)
1
(900,1)
(8000,1)
Author
Year 1999
Sold
Price
Author
Year 2000
Sold
Price
Author
Year 1999
Sold
Price
Author
14
Merge-GB Initialization
1
1
1
1
Book (2000)
Book (1999)
Book (1999)
Book
60,60
96,96
25,25
1
1
1
(5600,1)
1
(900,1)
(8000,1)
Author
Sold
Price
Author
Sold
Price
Author
Sold
Price
Author
15
Merge-GB Merge Phase
  • Repeated merging of nodes
  • Based on equality of associated values (eff
    group-by labels)
  • Counters are updated
  • Child nodes are merged analogously

16
Merge-GB Merge Phase
1
1
1
1
Book (2000)
Book (1999)
Book (1999)
Book
60,60
96,96
25,25
1
1
1
lt5600,1gt
1
(900,1)
lt8000,1gt
Author
Sold
Price
Author
Sold
Price
Author
Sold
Price
Author
17
Merge-GB Merge Phase
112
1
1
1
Book (2000)
Book (1999)
Book
Book (1999)
60,60
96,96
25,25
1
1
1
lt5600,1gt
1
(900,1)
lt8000,1gt
Author
Sold
Price
Author
Sold
Price
Author
Sold
Price
Author
1
25,25
lt8000,1gt
Sold
Price
Author
Merge ( 25,25, 60,60 ) 25,60
18
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
1
1
lt5600,1gt
1
(900,1)
Author
Sold
Price
Author
Sold
Price
Author
1
lt8000,1gt
Sold
Author
Merge ( lt5600,1gt, lt8000,1gt ) (lt5600,1gtlt8000,1gt)
19
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
1
1
lt5600,1gt lt8000,1gt
1
(900,1)
Author
Sold
Price
Author
Sold
Price
Author
1
Author
What about Author? Can these nodes be merged
analogously?
20
Merge-GB Merge Phase
  • Repeated merging of nodes
  • Based on equality of associated values (eff
    group-by labels)
  • Counters are updated
  • Child nodes are merged analogously
  • Different types of merges
  • Siblings vs Non-siblings
  • Primarily to handle nested aggregations

21
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
1
1
lt5600,1gt lt8000,1gt
1
(900,1)
Author
Sold
Price
Author
Sold
Price
Author
1
Author
MedianCount Median of Count Count requires
simple addition
22
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
2
1
lt5600,1gt lt8000,1gt
(900,1)
Sold
Price
Author
Sold
Price
Author
1
Author
MedianCount Median of Count Median requires a
frequency count
23
Merge-GB Merge Phase
2
1
1
Book (2000)
Book (1999)
Book
25,60
96,96
lt1,1gt lt2,1gt
1
lt5600,1gt lt8000,1gt
(900,1)
Sold
Price
Author
Sold
Price
Author
24
Merge-GB Nesting
  • Processing ß nodes
  • Create group corresponding to each new value.
  • Inner ß is processed analogously create a group
    with label ßout, ßin.
  • For group inversion, create a dummy a node and
    update later.
  • Processing ? nodes/ updating counters
  • Holistic collect all ? values corresponding to
    that ß.
  • In memory if possible otherwise dump to disk in
    ?-file
  • Otherwise compute aggregation on-the-fly
  • Siblings or non-siblings? Native XML structure
    allows decidability using exactly one integer per
    answer node!

25
Merge-GB Having clause
  • Anti-monotonic early pruning
  • Constraint remains false once violated
  • Helps in early pruning, for unnested queries
  • Nested Queries
  • Associate a flag with each ß group
  • ? calculations can be avoided for groups with
    negative flags
  • Processing of inner-nests can also be avoided
  • Inner block not processed until the having
    clauses in all the outer blocks have been
    processed
  • Separate ?files for each holistic calculation

26
Merge-GB Moving Windows
  • mw (width, step, winType, domType)
  • Repeated-aggregation strategy
  • Enumerate each group apriori
  • Aggregate for each group treating them
    independently
  • Extend the notion of hash(ß) to update all
    appropriate counters
  • Example
  • Year (WS 5, Step 2, Standard Domain, Incremental)
  • 1990-1994
  • 1992-1996
  • 1994-1998

Val20
Val50
1993, Price30
1990, Price20
Val30
27
Merge-GB Moving Windows
  • mw (width, step, winType, domType)
  • Rolling-over strategy
  • Construct a window with step1, width1
  • Results in an intermediate answer tree
  • Compute moving window aggregate after computing
    aggregate for first window, remove starting
    elements and add elements for the subsequent
    window

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
Val20
Val30
1990, Price20
1993, Price30
28
Merge-GB Additional Details
  • Multiple moving windows
  • Result in hyper-rectangular windows
  • Essentially performs a cartesian-product over all
    involved attributes
  • Moving windows and Having clause
  • Repeated aggregation requires no change
  • Rolling-over with holistic aggregation requires
    delay in processing of ?files.
  • Properties
  • Lemma 1 With the aforementioned setup, a value
    in an inner block that does not appear in the
    answer is not read after the value was written
    into the appropriate gamma file.
  • Lemma 2 For each value of , the rolling-over
    strategy guarantees that aggregation is done at
    most once.

29
Performance Evaluation
  • Setup
  • 2GHz CPU, 1GB RAM machine.
  • Datasets
  • XMARK (50-500MB)
  • DBLP (250-400MB)
  • Protein (13MB)
  • Comparison with Galax and Qizx
  • Evaluation Parameters
  • Size and number of groups
  • Fully vs. Partially specified paths
  • Increasing levels of nesting
  • Scalability

30
Performance Evaluation
Q1 Small number of large groups Q2 Large number
of small groups N-GB is not affected while Qizx
cannot handle a large number of groups
31
Performance Evaluation
Q5 Query with fully specified paths Q6 Query
with partially specified paths N-GB is not
affected while Qizx cannot handle partially
specified paths
32
Performance Evaluation
N-GB is linear as expected, whereas Qizx becomes
exponential with increasing levels of nesting
33
Performance Evaluation
Early pruning gives manifold improvement for
queries with having clause(s)
34
Performance Evaluation
Increase in window width/stop size results in
more overlapping computations and therefore gives
more gain
35
Performance Evaluation
Early pruning provides the maximum benefit,
while repeated aggregations degrades performance
36
Conclusions
  • Features to express complex aggregation queries
    on XML data
  • Efficient disk-based algorithm to compute
    group-by queries
  • Three-phase process
  • Stable, scalable, efficient
  • Natural support for various optimizations
  • Experimental results to affirm the above

37
Thank You ?
Write a Comment
User Comments (0)
About PowerShow.com