Title: On Relational Support for XML Publishing
1On Relational Support for XML Publishing
- Beyond Sorting and Tagging
- Surajit Chaudhuri
- Raghav Kaushik
- Jeffrey F. Naughton
- Presented by
- Conn Doherty
2Outline
- Motivation Observations
- XML
- Topic of Paper
- GApply Operator Approach
- Transformation Rules
- Experiments and Results
- Related Work
- Conclusions
- Future Problems
3Motivation
- Does the need for efficient XML publishing bring
any new requirements for relational query
engines, or is sorting query results in the
relational engine and tagging them in middleware
sufficient?
4Observations
- The mismatch between the XML data model and
relational model requires relational engines to
be enhances for efficiency - Need support for relation-valued variables
5XML
- Extendible Markup Language (rather
a metalanguage or metametalanguage) - Rapidly emerging as a standard for exchanging
business data - Substantial interest in publishing existing
relational data as XML
6Current XML Publishing
- Most focus has been on issues external to the
RDBMS - Determining the class of XML views that can be
defined - Languages used to specify the conversion from
relational data to XML - Methods of composing XML queries with XML views
- Data warehousing has caused focus on similar
issues internal to RDBMS
7Primary Topic of Paper
- Focus closely on the class of SQL queries that
are typically generated by XML publishing
applications - Ask if anything needs to be changed within the
relational engine to efficiently evaluate these
queries?
8YES!
- Differences in the XML and relational data models
- cause awkward and inefficient translations of XML
queries to relational SQL queries - Main Issue
- XMLs hierarchical model makes it very convenient
and natural to apply operators to subtrees
9Part Supplier Example
- Part and Supplier Data Set
- supplier(s_key, s_name)
- partsupp(ps_suppkey, ps_partkey)
- part(p_partkey, p_name, p_retailprice)
10Part Supplier Example
- Query Q1 For each supplier element, return the
names and retail prices of all parts supplied by
that supplier, and also, the over-all average
retail price of all parts supplied
Example XML Document ltsuppliersgt ltsuppliergt lt
snamegtS1lt/snamegt ltpartsgt ltpartgt ltpnamegtP1
lt/pnamegt ltretailpricegt10lt/retailpricegt lt/pa
rtgt ltpartgt ltpnamegtP2lt/pnamegt ltretailpri
cegt10lt/retailpricegt lt/partgt lt/partsgt lt/suppl
iergt ltsuppliergt ltsnamegtS2lt/snamegt ltpartsgt
ltpartgt ltpnamegtP21lt/pnamegt ltretailpricegt12lt
/retailpricegt lt/partgt ltpartgt ltpnamegtP22lt
/pnamegt ltretailpricegt13lt/retailpricegt lt/par
tgt lt/partsgt lt/suppliergt ltsuppliersgt
11Example Queries
- XQuery
- For s in /doc(tpch.xml)/suppliers/supplier
- Return ltretgt s/s_suppkey
- ltpartsgt
- For p in s/part
- Return ltpartgt
- p/p_name
- p/p_retailprice
- lt/partgt
- lt/partsgt
- avg(s/part/p_retailprice)
- lt/retgt
- SQL
- (select ps_suppkey, p_name, p_retailprice,null
- from partsupp, part
- where ps_partkey p_partkey
- union all
- select ps_suppkey,null,null, avg(p_retailprice)
- from partsupp, part
- where ps_partkey p_partkey
- group by ps_suppkey)
- Order by ps_suppkey
- SQL (relational data model) is hard to express
and inefficient - Unable to bind a variable to sets of tuples and
execute subqueries on these sets
123 Angle Approach
- 1) New operator, GApply
- Binds variable to sets of tuples
- Allows subqureies to be executed over set of
tuples (tmp relation) bound to a variable - 2) Propose transformation rules to modify query
plan trees with GApply operator - 3) Expose GApply operator in SQL syntax
13GApply Operator
- Syntax GApply(GCols, PGQ)
- GCols grouping/partitioning columns
- PGQ per-group query
- Input tuple stream is partitioned on GCols
- PGQ applied to each group
- Output is the union of all above results taken
over all groups
14Terminology
- Outer tuple stream input tuple stream
- Inner query per-group query
- Outer child of GApply root of outer query
- Inner child of GApply root of inner query
15PGQ Restrictions
- Only operate on temporary relation associated
with the group of tuples - Operator type also known as groupwise processing
- Operators allowed in PGQ scan, select, project,
distinct, apply, exists, union(all), groupby,
aggregate, and orderby
16Physical Implemenation
- Two Phases
- Partitioning Phase
- Implemented using sorting or hashing
- Execution Phase
- Performed in nested loop fashion
- PGQ is evaluated on each group of tuples
- Each group is a temporary relation bound to a
relation-valued parameter group
17Implementation Diagram
NL Nested Loop
Tmp relation group
group
Outer Child Outer Query Partition Phase
Inner Child Inner Query Execution Phase
18Expose GApply in Syntax
- Difficult for the parser and optimizer to
determine when GApply applies - Tests on Microsoft SQL Server 2000 with GApply
operator not exposed in syntax - Need sometimes identified by optimizer
- Use in each case, considerably speeds up
performance
19Proposed Syntax
- Proposed extension to SQL syntax
- SQL query performing groupwise processing
- Select gapply(PGQ(x)) as ltcolumn listgt
- from ltrelation listgt
- where ltconditionsgt
- group by ltgrouping columnsgt x
- x is a relation-valued variable
20Example Query in Syntax
- Query Q1
- select gapply(PGQ1(tmpSupp))
- from partsupp, part
- where ps_partkey p_partkey
- group by ps_suppkey tmpSupp
- PGQ1(tmpSupp)
- select p_name, p_retailprice, null
- from tmpSupp
- union all
- select null, null, avg(p_retailprice)
- from tmp
21Transformation Rules
- Precise semantics of the operators
- Three categories
- 1) Pushing Computation into the Outer Query
- Placing Projections Before GApply
- Placing Selections Before GApply
- Converting GApply to groupby
- 2) Group Selection
- 3) Pushing GApply Below Joins
22Rule 2
- Group Selection
- Consider PGQ that either return whole group
(subtree) or nothing based on a predicate - Two methods to evaluate
- Join suppliers parts, group by suppkey, check
selection method on group, if true - return group - Selection method to get suppkeys, then return
join - Second method will win if predicate is highly
selective
23Rule 2 cont.
- Example
- For s in /doc(tpch.xml)/suppliers
- /supplier/part/p_retailprice gt 1000
- Return s
24Integrating Rules in Optimizer
- None of the rules above loop -gt optimizer
terminates - Optimizer must estimate the cost of the GApply
operation
25Preliminary Experiments
- Performance study
- Find efficacy of the GApply operator to speed up
queries - Understand impact of each proposed transformation
rule - Microsoft SQL Server 2000
- Supports GApply without syntax exposure
- Control over GApply invocation is needed
- Simulate operation of GApply on the client side
26Client Side Simulation of GApply
- Partition
- Sorting
- Hashing (simulation)
- Execute
- Store result of outer query in temporary table
- For each distinct tmp group relation, evaluate
PGQ on that relation, then union all results
27Estimate Running Time
- Measure both elapsed time and CPU time
- Operator trees with GApply is the top most
operator - Expect real elapsed time less in full server
implementation
28Setup
- Experimental Setup
- TPCH benchmark data
- 5GB database
- Server
- 1 GHz processor
- 784 MB main memory
- 512 MB buffer pool
- Each query ran several times and then average
taken
29Results
- Effectiveness of GApply
- Comparable whether performing partitioning using
sorting or hashing - Tested 4 queries representing a wide range of
queries
30GApply Effectiveness Results
- Main conclusions
- GApply is a useful operator even for simple
XQuery queries - Yields improvements of factors of up to 2x faster
- Queries representative of a wide class of queries
- Q4 took 20 longer with the client side
implementation - Q1, Q2, Q3 expect performance improvements with
server side implementation
(hash-based partitioning)
31Results cont.
- Effectiveness of Optimization Rules
- Tested the improvement obtained by firing each
rule - Performance metric is elapsed time
- Method
- Choose relevant parameterized query
- Vary parameter and find performance benefit for
each value - Benefit ratio elapsed time without the rule to
time taken with the rule fired
32Rule Effectiveness Example
- Query
- For s in /doc(tpch.xml)/suppliers
- /supplier/part/p_retailprice gt x
- Return s
- x parameter determines the selectivity of
selection
33Results cont.
- Effectiveness of Optimization Rules
- Main conclusions
- Proposed rules can have significant impact on
elapsed time of a query involving GApply - Some rules always lowered cost of the query,
while other occasionally lowered or increased
cost - Benefit of converting GApply to groupby is
comparatively lower
34Related Work
- Xperanto Project
- Concluded, pushing as much computation to
relational engine is best - SilkRoute Project
- Language to specify the conversion between
relational data and XML - ROLEX Project
- To avoid inefficient parsing in applications, the
relational engine returns a navigable result tree - Difference
- Question whether whole process of XML publishing
has any impact on the core relational operators
(YES)
35Conclusions
- Relational engine must provide support for
binding variable to sets of tuples - Required support can be enabled through the
GApply operator with seamless integration into
existing relational engines - Operator should be exposed in the syntax
- Optimization rules are needed
36Future Problems
- How should modified syntax be exploited by
algorithms to translate XML queries over XML
views of relational data? - Any other changes needed to meet the requirements
of XML publishing? - What changes are needed in the optimizer if the
relational database returns navigable results?
37Other Papers
- D. Chatziantoniou and K. A. Ross. Querying
multiple features of groups in relational
databases. In VLDB, 1996. - Extension to SQL syntax with relational algebra
implementation - D. Chatziantoniou and K. A. Ross. Groupwise
processing of relational queries. In VLDB, 1997. - Methods to identify group query components
- C. A. Galindo-Legaria and M. M. Joshi. Ortogonal
optimization of subqueries and aggregation. In
SIGMOD, 2001. - Introduction of segmentApply operator and many
transformation rules