Title: Mary F. Fernandez
1SilkRoute A Framework for Publishing Rational
Data in XML
- Mary F. Fernandez
- Yana Kadiyska
- Dan Suciu
- Atsuyuki Morishima
- Wang Chiew Tan
ACM Transactions on Database System 2002, volume
27, no. 4
2Outline
- Introduction
- System Architecture
- XQuery to SQL Translation
- XML Publishing Optimization
- Conclusion
3Introduction
4Introduction
- XML as language for data exchange between
applications - Platform neutral
- Storage independent
- Semantics rich
- But most existing data are stored in RDBMS
- Conversion from RDBMS to XML should be
- General
- Selective
- Efficient
5Design Goals
- General
- Two different data representations
- Inherent complex conversion
- No public XML schema matches exactly relational
counterparts - Need for mapping multiple XML to same relational
source under different situations
XML Relational
Nested Flat
Unnormalized Normalized
Public Proprietary
6Design Goals
- Selective
- XML schema as virtual view
- XQuery may only extract parts of the view
- Only materialize the data fragment needed
- Efficient
- Reduce computational overhead as much as possible
- Exploit optimization and evaluation engines
provided by DB
7Contributions
- XQuery to SQL Translation
- Separate structure of XML output from computation
which constructs it - Introduce view forest which nodes are mapping
from XML to SQL - Optimization Problem for XML Publishing
- XQuery is mapped to one or more SQL
- Selecting the most efficient set
8System Architecture
9System Architecture
- 3-tier architecture
- Backend database for storage
- SilkRoute as middleware
- Third party application (Client)
10System Architecture
- SilkRoute transforms relational schema to
canonical view forest
11System Architecture
- Administrator uses XQuery to extract public XML
view over canonical view, this gives public view
forest
12System Architecture
- Client application uses XQuery to data in
interest, this gives application view forest
13System Architecture
- Planner inside SilkRoute gets application view
forest to produce - SQLQuery (which then enters RDBMS)
- XMLTemplate (which then enters XML Generator)
14System Architecture
- Tuples from RDBMS RDBMS is fed into the
XMLTemplate inside generator - Final XML output is then returned to the client
15XQuery to SQL Translation
16View Forest
- Internal representation of XML, a critical role
in SilkRoute - Each node carries
- XML label
- SQL query fragment
- For internal nodes, XML label is either element
or attribute names SQL fragment contains FROM
and WHERE clauses - For leaf nodes, XML label is atomic type SQL
fragment consists of SELECT clause
17View Forest SQL Association
- A complete SQL query Cn is associated with each
node n - SELECT if n is leaf node, use its SELECT
clause - otherwise SELECT
- FROM concat all FROM clauses of n and its
ancestor - WHERE concat all WHERE clauses of n and its
ancestor
18View Forest An Example
- To view the list of clothing item, call CN1.1.5.1
- SELECT c.price
- FROM Clothing c
- To view all data, call CN1.1
- SELECT
- FROM Clothing c
19An Example
- To view product code, call CN1.2.6.1.1
- SELECT p.code
- FROM Clothing c, Problems p
- WHERE c.category outerwear
- AND p.pid c.pid
20VFCA
- View Forest Composition Algorithm
- Translate XQuery into view forest
- Used in view forest composer
- Operate on a subset of XQuery syntax, XQueryCore,
which includes everything but NOT - Recursive functions and operators (eg before ,
after, is, isnot - Functions depending on XML document order
21VFCA
- Suppose
- I instance of relational schema
- V(I) view forest corresponding to the instance
I - Q(V(I)) XQuery over the view forest V(I), the
result return from this query is again a view
forest, denoted by VQ(I) - Input An XQueryCore expression Q over view
forests V1, , Vm - Output A view forest VQ VFCA(V1, , Vm) such
that - Q(V1, , Vm) VQ(I) for any I of the schema
22VFCA
- Traditional XQuery Evaluation
- Inner XQuery is evaluated first to return results
for outer XQuery - Producing a number of intermediate XML outputs
- VFCA
- Only view forest is produced, eliminating
intermediate results - Data extraction is done once using the set of SQL
realized from forest
23XML Publishing Optimization
24View Forest Evaluation
- Steps in evaluating view forest
- Planner adds a key to each SQL fragment
- Planner partitions each view tree into one or
more subtrees - Each subtree is realized to a SQL
- The set of all SQL is submitted to relational
engine - XML generator consumes tuple streams and
construct one single integrated relation - XML output is produced by nesting tuples
integrated relation and tagging each element
25View Forest Evaluation
26View Forest Evaluation
- The set of SQL emitted from view forest
evaluation is called a plan - Given a view forest , there can be many different
plans - (exponential to the number of edges)
- Two extreme cases
- Emit a SQL for each node (fully partitioned
strategy) - Emit a SQL for each tree in the forest (unified
strategy)
27View Forest Evaluation
- LEFT OUTER JOIN to connect a parent and child
nodes SQL - UNION to bind together sibling nodes SQL
- ORDER BY is needed to allow XML generator takes
only one single pass over the tuple streams - The tree can also be split into connected
components - SQL is then generated for each component
28View Forest Evaluation
- SELECT 1 AS L1, c.pid, L2, (c.price Q.discount)
as sale, Q.code, Q.comments - FROM Clothing c LEFT OUTER JOIN
- ((SELECT 1 AS L2, d.pid AS pid, d.discount AS
discount, null AS code, null AS comments FROM
Discount d) - UNION
- (SELECT 2 AS L2, p.pid AS pid, null AS discount,
p.code AS code, p.comments AS comments FROM
Problems p)) AS Q ON c.pid Q.pid - ORDER BY L1, c.pid, L2, Q.code
29View Forest Evaluation
- SELECT 1 as L1, c.pid, (d.discount c.price)
- FROM Cloothing c
- LEFT OUTER JOIN (SELECT d.pid, d.discount FROM
Discount d) - ON c.pid d.pid ORDER BY c.pid
- SELECT 2 as L1, c.pid, p.comments
- FROM Clothing c, Problems p
- WHERE c.pid p.pid
- ORDER BY c.pid, p.code
30Choosing Efficient Plan
- As there exists exponential number of plans,
SilkRoute resorts to heuristcs - UNION and OUTER JOIN are costly operations
- Only choosing edges whose two associated queries
are less expensive to evaluate together than
separately - Cost Difference cost(qc) (cost(q1) cost(q2))
- Where qc is the combined query of parent and
child nodes - And q1 and q2 are the parent query and child
query respectively
31Choosing Efficient Plan
- Seek for cost estimates from relational source
engine - The estimate cost of a query q is obtained by
- cost(q, a, b) a evaluation_cost(q) b
data_size(q) - AND data_size(q) f(attrs(q) cardinality(q))
- Relational engine provides evaluation_cost and
cardinality - a and b are weights
32Results
33Conclusion
- Proprietary commercial products have prospered in
this field - DB2
- SQL Server
- Oracle
- Native XML database
- SilkRoute free!! (work with MySQL/PostgreSQL)
34Thank you!!