On Relational Support for XML Publishing - PowerPoint PPT Presentation

About This Presentation
Title:

On Relational Support for XML Publishing

Description:

Rapidly emerging as a standard for exchanging business data ... Performed in nested loop fashion. PGQ is evaluated on each group of tuples ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 38
Provided by: connpadrai
Learn more at: http://web.cs.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: On Relational Support for XML Publishing


1
On Relational Support for XML Publishing
  • Beyond Sorting and Tagging
  • Surajit Chaudhuri
  • Raghav Kaushik
  • Jeffrey F. Naughton
  • Presented by
  • Conn Doherty

2
Outline
  • Motivation Observations
  • XML
  • Topic of Paper
  • GApply Operator Approach
  • Transformation Rules
  • Experiments and Results
  • Related Work
  • Conclusions
  • Future Problems

3
Motivation
  • Does the need for efficient XML publishing bring
    any new requirements for relational query
    engines, or is sorting query results in the
    relational engine and tagging them in middleware
    sufficient?

4
Observations
  • The mismatch between the XML data model and
    relational model requires relational engines to
    be enhances for efficiency
  • Need support for relation-valued variables

5
XML
  • Extendible Markup Language (rather
    a metalanguage or metametalanguage)
  • Rapidly emerging as a standard for exchanging
    business data
  • Substantial interest in publishing existing
    relational data as XML

6
Current XML Publishing
  • Most focus has been on issues external to the
    RDBMS
  • Determining the class of XML views that can be
    defined
  • Languages used to specify the conversion from
    relational data to XML
  • Methods of composing XML queries with XML views
  • Data warehousing has caused focus on similar
    issues internal to RDBMS

7
Primary Topic of Paper
  • Focus closely on the class of SQL queries that
    are typically generated by XML publishing
    applications
  • Ask if anything needs to be changed within the
    relational engine to efficiently evaluate these
    queries?

8
YES!
  • Differences in the XML and relational data models
  • cause awkward and inefficient translations of XML
    queries to relational SQL queries
  • Main Issue
  • XMLs hierarchical model makes it very convenient
    and natural to apply operators to subtrees

9
Part Supplier Example
  • Part and Supplier Data Set
  • supplier(s_key, s_name)
  • partsupp(ps_suppkey, ps_partkey)
  • part(p_partkey, p_name, p_retailprice)

10
Part Supplier Example
  • Query Q1 For each supplier element, return the
    names and retail prices of all parts supplied by
    that supplier, and also, the over-all average
    retail price of all parts supplied

Example XML Document ltsuppliersgt ltsuppliergt lt
snamegtS1lt/snamegt ltpartsgt ltpartgt ltpnamegtP1
lt/pnamegt ltretailpricegt10lt/retailpricegt lt/pa
rtgt ltpartgt ltpnamegtP2lt/pnamegt ltretailpri
cegt10lt/retailpricegt lt/partgt lt/partsgt lt/suppl
iergt ltsuppliergt ltsnamegtS2lt/snamegt ltpartsgt
ltpartgt ltpnamegtP21lt/pnamegt ltretailpricegt12lt
/retailpricegt lt/partgt ltpartgt ltpnamegtP22lt
/pnamegt ltretailpricegt13lt/retailpricegt lt/par
tgt lt/partsgt lt/suppliergt ltsuppliersgt
11
Example Queries
  • XQuery
  • For s in /doc(tpch.xml)/suppliers/supplier
  • Return ltretgt s/s_suppkey
  • ltpartsgt
  • For p in s/part
  • Return ltpartgt
  • p/p_name
  • p/p_retailprice
  • lt/partgt
  • lt/partsgt
  • avg(s/part/p_retailprice)
  • lt/retgt
  • SQL
  • (select ps_suppkey, p_name, p_retailprice,null
  • from partsupp, part
  • where ps_partkey p_partkey
  • union all
  • select ps_suppkey,null,null, avg(p_retailprice)
  • from partsupp, part
  • where ps_partkey p_partkey
  • group by ps_suppkey)
  • Order by ps_suppkey
  • SQL (relational data model) is hard to express
    and inefficient
  • Unable to bind a variable to sets of tuples and
    execute subqueries on these sets

12
3 Angle Approach
  • 1) New operator, GApply
  • Binds variable to sets of tuples
  • Allows subqureies to be executed over set of
    tuples (tmp relation) bound to a variable
  • 2) Propose transformation rules to modify query
    plan trees with GApply operator
  • 3) Expose GApply operator in SQL syntax

13
GApply Operator
  • Syntax GApply(GCols, PGQ)
  • GCols grouping/partitioning columns
  • PGQ per-group query
  • Input tuple stream is partitioned on GCols
  • PGQ applied to each group
  • Output is the union of all above results taken
    over all groups

14
Terminology
  • Outer tuple stream input tuple stream
  • Inner query per-group query
  • Outer child of GApply root of outer query
  • Inner child of GApply root of inner query

15
PGQ Restrictions
  • Only operate on temporary relation associated
    with the group of tuples
  • Operator type also known as groupwise processing
  • Operators allowed in PGQ scan, select, project,
    distinct, apply, exists, union(all), groupby,
    aggregate, and orderby

16
Physical Implemenation
  • Two Phases
  • Partitioning Phase
  • Implemented using sorting or hashing
  • Execution Phase
  • Performed in nested loop fashion
  • PGQ is evaluated on each group of tuples
  • Each group is a temporary relation bound to a
    relation-valued parameter group

17
Implementation Diagram
NL Nested Loop
Tmp relation group
group
Outer Child Outer Query Partition Phase
Inner Child Inner Query Execution Phase
18
Expose GApply in Syntax
  • Difficult for the parser and optimizer to
    determine when GApply applies
  • Tests on Microsoft SQL Server 2000 with GApply
    operator not exposed in syntax
  • Need sometimes identified by optimizer
  • Use in each case, considerably speeds up
    performance

19
Proposed Syntax
  • Proposed extension to SQL syntax
  • SQL query performing groupwise processing
  • Select gapply(PGQ(x)) as ltcolumn listgt
  • from ltrelation listgt
  • where ltconditionsgt
  • group by ltgrouping columnsgt x
  • x is a relation-valued variable

20
Example Query in Syntax
  • Query Q1
  • select gapply(PGQ1(tmpSupp))
  • from partsupp, part
  • where ps_partkey p_partkey
  • group by ps_suppkey tmpSupp
  • PGQ1(tmpSupp)
  • select p_name, p_retailprice, null
  • from tmpSupp
  • union all
  • select null, null, avg(p_retailprice)
  • from tmp

21
Transformation Rules
  • Precise semantics of the operators
  • Three categories
  • 1) Pushing Computation into the Outer Query
  • Placing Projections Before GApply
  • Placing Selections Before GApply
  • Converting GApply to groupby
  • 2) Group Selection
  • 3) Pushing GApply Below Joins

22
Rule 2
  • Group Selection
  • Consider PGQ that either return whole group
    (subtree) or nothing based on a predicate
  • Two methods to evaluate
  • Join suppliers parts, group by suppkey, check
    selection method on group, if true - return group
  • Selection method to get suppkeys, then return
    join
  • Second method will win if predicate is highly
    selective

23
Rule 2 cont.
  • Example
  • For s in /doc(tpch.xml)/suppliers
  • /supplier/part/p_retailprice gt 1000
  • Return s

24
Integrating Rules in Optimizer
  • None of the rules above loop -gt optimizer
    terminates
  • Optimizer must estimate the cost of the GApply
    operation

25
Preliminary Experiments
  • Performance study
  • Find efficacy of the GApply operator to speed up
    queries
  • Understand impact of each proposed transformation
    rule
  • Microsoft SQL Server 2000
  • Supports GApply without syntax exposure
  • Control over GApply invocation is needed
  • Simulate operation of GApply on the client side

26
Client Side Simulation of GApply
  • Partition
  • Sorting
  • Hashing (simulation)
  • Execute
  • Store result of outer query in temporary table
  • For each distinct tmp group relation, evaluate
    PGQ on that relation, then union all results

27
Estimate Running Time
  • Measure both elapsed time and CPU time
  • Operator trees with GApply is the top most
    operator
  • Expect real elapsed time less in full server
    implementation

28
Setup
  • Experimental Setup
  • TPCH benchmark data
  • 5GB database
  • Server
  • 1 GHz processor
  • 784 MB main memory
  • 512 MB buffer pool
  • Each query ran several times and then average
    taken

29
Results
  • Effectiveness of GApply
  • Comparable whether performing partitioning using
    sorting or hashing
  • Tested 4 queries representing a wide range of
    queries

30
GApply Effectiveness Results
  • Main conclusions
  • GApply is a useful operator even for simple
    XQuery queries
  • Yields improvements of factors of up to 2x faster
  • Queries representative of a wide class of queries
  • Q4 took 20 longer with the client side
    implementation
  • Q1, Q2, Q3 expect performance improvements with
    server side implementation

(hash-based partitioning)
31
Results cont.
  • Effectiveness of Optimization Rules
  • Tested the improvement obtained by firing each
    rule
  • Performance metric is elapsed time
  • Method
  • Choose relevant parameterized query
  • Vary parameter and find performance benefit for
    each value
  • Benefit ratio elapsed time without the rule to
    time taken with the rule fired

32
Rule Effectiveness Example
  • Query
  • For s in /doc(tpch.xml)/suppliers
  • /supplier/part/p_retailprice gt x
  • Return s
  • x parameter determines the selectivity of
    selection

33
Results cont.
  • Effectiveness of Optimization Rules
  • Main conclusions
  • Proposed rules can have significant impact on
    elapsed time of a query involving GApply
  • Some rules always lowered cost of the query,
    while other occasionally lowered or increased
    cost
  • Benefit of converting GApply to groupby is
    comparatively lower

34
Related Work
  • Xperanto Project
  • Concluded, pushing as much computation to
    relational engine is best
  • SilkRoute Project
  • Language to specify the conversion between
    relational data and XML
  • ROLEX Project
  • To avoid inefficient parsing in applications, the
    relational engine returns a navigable result tree
  • Difference
  • Question whether whole process of XML publishing
    has any impact on the core relational operators
    (YES)

35
Conclusions
  • Relational engine must provide support for
    binding variable to sets of tuples
  • Required support can be enabled through the
    GApply operator with seamless integration into
    existing relational engines
  • Operator should be exposed in the syntax
  • Optimization rules are needed

36
Future Problems
  • How should modified syntax be exploited by
    algorithms to translate XML queries over XML
    views of relational data?
  • Any other changes needed to meet the requirements
    of XML publishing?
  • What changes are needed in the optimizer if the
    relational database returns navigable results?

37
Other Papers
  • D. Chatziantoniou and K. A. Ross. Querying
    multiple features of groups in relational
    databases. In VLDB, 1996.
  • Extension to SQL syntax with relational algebra
    implementation
  • D. Chatziantoniou and K. A. Ross. Groupwise
    processing of relational queries. In VLDB, 1997.
  • Methods to identify group query components
  • C. A. Galindo-Legaria and M. M. Joshi. Ortogonal
    optimization of subqueries and aggregation. In
    SIGMOD, 2001.
  • Introduction of segmentApply operator and many
    transformation rules
Write a Comment
User Comments (0)
About PowerShow.com