Title: Database Systems and XML
1Database Systems and XML
- David Wu
- CS 632
- April 23, 2001
2Researched Papers
- J. Shanmugasundaram, et al. "Efficiently
Publishing Relational Data as XML Documents",
VLDB Conference, September 2000. - J. Shanmugasundaram, et al. "Relational Databases
for Querying XML Documents Limitations and
Opportunities," VLDB Conference, September 1999.
3Efficiently Publishing Relational Data as XML
Documents
4Motivation
- Relational database systems and XML are heavily
used on the Web. - Would like some way to publish relational data as
XML.
5What is Needed
- Language to specify the conversion from
relational data to XML. - Implementation to efficiently carry out the
conversion.
6SQL Based Language
7Implementation Alternatives
- Main differences between relations and XML
- XML docs have tags
- XML has nested structure
8Early Tagging, Early Structuring
- Stored Procedure Approach (outside engine)
- Performs a nested-loop join by issuing queries
for each nested structure in the desired XML. - High overhead due to the number of queries.
- Fixed join order.
9Early Tagging, Early Structuring
- Correlated CLOB Approach (inside engine)
- Have one large query with sub-queries is run
within the engine. - Must add XML constructor support to the engine.
- XML fragments from the constructors are stored as
CLOBs (Character Long Objects). Costly to
handle. - De-Correlated CLOB Approach (inside)
- Perform query de-correlation to give optimizer
more flexibility.
10Late Tagging, Late Structuring
- Two phases
- Content creation
- Tagging and structuring
11Late Tagging, Late Structuring
- Content Creation Redundant Relation Approach
- Join all source tables
- Both content and process redundancy
12Late Tagging, Late Structuring
- Content creation Outer Union Approach
- Separate the children of the same parent (e.g.
one tuple should represent either account or
purchaseOrder). - At the end outer union the results.
- Still some data redundancy (e.g. parent info)
13Late Tagging, Late Structuring
14Late Tagging, Late Structuring
- Structuring/Tagging Hashed-based Tagger
- Group by hashing
- Extract tuples and tag them.
15Late Tagging, Early Structuring
- Late Tagging, Late Structuring requires much
memory for the hash table. - Fix by creating structured content and then tag.
16Late Tagging, Early Structuring
- Structured content Sorted Outer Union Approach
- Desired format
- Parent information comes before or with its child
- All info of a node and its descendants occur
together - Relative order of the tuples matches
user-specified order - Achieve by performing a sort on ids on the result
of the outer union.
17Late Tagging, Early Structuring
- Tagging Sorted DataConstantSpaceTagger
- Can append tags as soon as data is seen.
- Only need to remember the parent ids of the last
tuple seen to know when to append closing tags.
18Experiement
- Inside Engine
- Outside Engine
19Breakdown of Construction
20Summary of Results
- Constructing inside the relational engine is more
efficient. - When processing can be done in main mem, the
Unsorted Outer Union approach wins. - When main mem is not enough, the Sorted Outer
Union approach is best.
21Relational Databases for Querying XML Documents
22Why Bother?
- XML is becoming the standard for data
representation in WWW. - A query engine designed to tap information from
XML documents is valuable. - Relational database system is a mature technology
and could be used to support XML querying.
23Basic Idea
- Step 1 Generate a relational schema from the
DTD - Step 2 Parse the XML document and load the data
into tuples of the relational table. - Step 3 Translate the semi-structured XML
queries into SQL corresponding to the
relational data. - Step 4 Convert the result back to XML.
24Translating XML to Relational Schema
- Main Issues
- DTDs complexity
- Arbitrary nesting of XML DTDs vs. two-level
nature of relational schemas. - Set-valued attributes and recursion
25- Flattening transformation
- Simplification transformation of unary operations
- Grouping transformation
26 Techniques to translate XML DTD to relations.
- Basic Inlining Technique
- Shared Inlining Technique
- Hybrid Inlining technique
27Basic Inlining Technique
- Inlining as many descendants of an element into a
relation. (authorfirstname,lastname,address) - Every element will have a relation corresponding
to it. (firstname, lastname, and address will all
have elements)
28Basic Inlining Technique (cont.)
- Complications
- Set-valued attributes (eg. Article)
- Solve by using foreign keys and other tables.
- Recursion
- Solve with relational keys and relational
recursive processing to retrieve the
relationship.
29Tools used in creating relations
- DTD Graph
- Nodes are elements, attributes,operators
- Each element appears once
- Attributes and operators appear as many times as
they do in the DTD - Cycles in the graph indicates recursion
30Tools used in creating relations
- Element Graphs
- Generated from the DTD graph
- Created by doing a DFS from an element node
31Creating a Relation
- Given an element graph, the root it made into
- a relation with all descendents inlined into it,
- except
- Children directly below a are made into
separate relations - Each node with a backpointer edge are made into
separate relations. - These additional relations are named by their
path - from the root and have parentID fields that serve
as - foreign keys (e.g. Article.author has the
attribute - article.author.parentID)
32Problems with Basic
- Large number of relations it creates
- Not efficient for certain queries
- Good list all authors of books
- Bad list all authors having first name Jack
33Shared Inlining Technique
- Idea Identify commonly used element nodes and
share them by creating separate relations for
them.
34Shared Inlining Technique
- Rules for creating relations
- Nodes with in-degreegt1 have relations made
- Nodes with in-degree1 are inlined
- Nodes with in-degree0 have relations made
- Nodes following have relations made
- Nodes with in-degree1 AND mutually recurive, one
of them is made into a relation
35Shared Inlining Technique
- Rules for designing the schema
- Relation X inlines all nodes Y that it an reach
such that the path from X to Y does not contain a
node that is to be made a separate relation. - Inlined elements are flagged as being a root with
the isRoot field.
36Problems with Shared
37Hybrid Inlining Technique
- Same as Shared except Hybrid also inlines
elements that - have in-degreegt1 AND
- are not recursive AND
- are not reached through a node.
38Evaluation Metric
- For path expressions of length N, data was
gathered on - The avg number of SQL queries generated
- The avg number of joins in each SQL query
- The total average number of joins in order to
process the path expression
39Results for N3
- For Basic, 1/3 of the DTDs tests didnt run to
completion due to lack of virtual memory. Basic
is thus ignored.
40Results for N3
41Results for N3
- Group 1 Hybrid reduce join/query, increases a
smaller amount of queries gt Hybrid requires
fewer joins than shared. - Group 2 Hybrid reduces join/query, increases a
comparable amount of queriesgt Hybrid and Shared
are the same.
42Results for N3
- Group 3 Hybrid reduces some joins/query, but
increased the queries by a lot gt Hybrid
generates more joins than Shared. - Hybrid and Shared performed similarly in both
joins/query and of queries gt Hybrid and
Shared are about the same.
43Semi-Structured Queries to SQL
- Semi-structured query languages
- Allow path expressions with various operators and
wildcards. - XML-QL Query Lorel
44Simple Path to SQL
- The relations corresponding to the start of the
root path is added to the FROM clause. - If needed, the path expressions are translated to
joins.
45Simple Recursive Path to SQL
- Find initialization of the recursion (e.g.
.monograph.editor with condition
monograph.title Subclass Cirripedia) - Find the actual recursive path expression (e.g.
monograph.editor) - Union the two
46Arbitrary Path to Simple Recursive Path
- Use a general technique to translate path
expressions to many simple (recursive) path
expressions. -
47Relational Results to XML Simple Structuring
- Requires only attaching appropriate tags to each
tuple.
48Relational Results to XML Tag Variables
- Have the relational query contain the tag value
in the result tuple. Then just covert it to a
tag during XML generation.
49Grouping
- Could sort the result tuples by the group-by
field and and scan through it in order when
generating the XML. - Could do a grouping operation.
50Other Cases
- Complex Element Construction
- e.g. asking for all article elements and assume
that may be multiple elements (e.g. author
title) - Difficult to do in traditional relational model.
- Heterogeneous Results
- e.g. asking for either title or author of
article. - Could be done in two queries and then merged.
51Other Cases
- Nested Queries
- Could be rewritten in terms of SQL queries using
outer joins.
52Conclusion
- Suggested modifications to relational systems
- Untyped/variable-typed references.
- Information retrieval style indices
- Flexible comparison operators
- Multiple-query optimization/execution
- More powerful recursion support.