Database Systems and XML - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Database Systems and XML

Description:

... queries is run within the engine. Must add XML constructor support to the engine. ... A query engine designed to tap information from XML documents is valuable. ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 53

Provided by: willia160

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Database Systems and XML

1
Database Systems and XML

David Wu
CS 632
April 23, 2001

2
Researched Papers

J. Shanmugasundaram, et al. "Efficiently
Publishing Relational Data as XML Documents",
VLDB Conference, September 2000.
J. Shanmugasundaram, et al. "Relational Databases
for Querying XML Documents Limitations and
Opportunities," VLDB Conference, September 1999.

3
Efficiently Publishing Relational Data as XML
Documents
4
Motivation

Relational database systems and XML are heavily
used on the Web.
Would like some way to publish relational data as
XML.

5
What is Needed

Language to specify the conversion from
relational data to XML.
Implementation to efficiently carry out the
conversion.

6
SQL Based Language
7
Implementation Alternatives

Main differences between relations and XML
XML docs have tags
XML has nested structure

8
Early Tagging, Early Structuring

Stored Procedure Approach (outside engine)
Performs a nested-loop join by issuing queries
for each nested structure in the desired XML.
High overhead due to the number of queries.
Fixed join order.

9
Early Tagging, Early Structuring

Correlated CLOB Approach (inside engine)
Have one large query with sub-queries is run
within the engine.
Must add XML constructor support to the engine.
XML fragments from the constructors are stored as
CLOBs (Character Long Objects). Costly to
handle.
De-Correlated CLOB Approach (inside)
Perform query de-correlation to give optimizer
more flexibility.

10
Late Tagging, Late Structuring

Two phases
Content creation
Tagging and structuring

11
Late Tagging, Late Structuring

Content Creation Redundant Relation Approach
Join all source tables
Both content and process redundancy

12
Late Tagging, Late Structuring

Content creation Outer Union Approach
Separate the children of the same parent (e.g.
one tuple should represent either account or
purchaseOrder).
At the end outer union the results.
Still some data redundancy (e.g. parent info)

13
Late Tagging, Late Structuring

Outer Union Plan

14
Late Tagging, Late Structuring

Structuring/Tagging Hashed-based Tagger
Group by hashing
Extract tuples and tag them.

15
Late Tagging, Early Structuring

Late Tagging, Late Structuring requires much
memory for the hash table.
Fix by creating structured content and then tag.

16
Late Tagging, Early Structuring

Structured content Sorted Outer Union Approach
Desired format
Parent information comes before or with its child
All info of a node and its descendants occur
together
Relative order of the tuples matches
user-specified order
Achieve by performing a sort on ids on the result
of the outer union.

17
Late Tagging, Early Structuring

Tagging Sorted DataConstantSpaceTagger
Can append tags as soon as data is seen.
Only need to remember the parent ids of the last
tuple seen to know when to append closing tags.

18
Experiement

Inside Engine
Outside Engine

19
Breakdown of Construction
20
Summary of Results

Constructing inside the relational engine is more
efficient.
When processing can be done in main mem, the
Unsorted Outer Union approach wins.
When main mem is not enough, the Sorted Outer
Union approach is best.

21
Relational Databases for Querying XML Documents
22
Why Bother?

XML is becoming the standard for data
representation in WWW.
A query engine designed to tap information from
XML documents is valuable.
Relational database system is a mature technology
and could be used to support XML querying.

23
Basic Idea

Step 1 Generate a relational schema from the
DTD
Step 2 Parse the XML document and load the data
into tuples of the relational table.
Step 3 Translate the semi-structured XML
queries into SQL corresponding to the
relational data.
Step 4 Convert the result back to XML.

24
Translating XML to Relational Schema

Main Issues
DTDs complexity
Arbitrary nesting of XML DTDs vs. two-level
nature of relational schemas.
Set-valued attributes and recursion

Flattening transformation
Simplification transformation of unary operations
Grouping transformation

26
Techniques to translate XML DTD to relations.

Basic Inlining Technique
Shared Inlining Technique
Hybrid Inlining technique

27
Basic Inlining Technique

Inlining as many descendants of an element into a
relation. (authorfirstname,lastname,address)
Every element will have a relation corresponding
to it. (firstname, lastname, and address will all
have elements)

28
Basic Inlining Technique (cont.)

Complications
Set-valued attributes (eg. Article)
Solve by using foreign keys and other tables.
Recursion
Solve with relational keys and relational
recursive processing to retrieve the
relationship.

29
Tools used in creating relations

DTD Graph
Nodes are elements, attributes,operators
Each element appears once
Attributes and operators appear as many times as
they do in the DTD
Cycles in the graph indicates recursion

30
Tools used in creating relations

Element Graphs
Generated from the DTD graph
Created by doing a DFS from an element node

31
Creating a Relation

Given an element graph, the root it made into
a relation with all descendents inlined into it,
except
Children directly below a are made into
separate relations
Each node with a backpointer edge are made into
separate relations.
These additional relations are named by their
path
from the root and have parentID fields that serve
as
foreign keys (e.g. Article.author has the
attribute
article.author.parentID)

32
Problems with Basic

Large number of relations it creates
Not efficient for certain queries
Good list all authors of books
Bad list all authors having first name Jack

33
Shared Inlining Technique

Idea Identify commonly used element nodes and
share them by creating separate relations for
them.

34
Shared Inlining Technique

Rules for creating relations
Nodes with in-degreegt1 have relations made
Nodes with in-degree1 are inlined
Nodes with in-degree0 have relations made
Nodes following have relations made
Nodes with in-degree1 AND mutually recurive, one
of them is made into a relation

35
Shared Inlining Technique

Rules for designing the schema
Relation X inlines all nodes Y that it an reach
such that the path from X to Y does not contain a
node that is to be made a separate relation.
Inlined elements are flagged as being a root with
the isRoot field.

36
Problems with Shared

Too many joins required!

37
Hybrid Inlining Technique

Same as Shared except Hybrid also inlines
elements that
have in-degreegt1 AND
are not recursive AND
are not reached through a node.

38
Evaluation Metric

For path expressions of length N, data was
gathered on
The avg number of SQL queries generated
The avg number of joins in each SQL query
The total average number of joins in order to
process the path expression

39
Results for N3

For Basic, 1/3 of the DTDs tests didnt run to
completion due to lack of virtual memory. Basic
is thus ignored.

40
Results for N3
41
Results for N3

Group 1 Hybrid reduce join/query, increases a
smaller amount of queries gt Hybrid requires
fewer joins than shared.
Group 2 Hybrid reduces join/query, increases a
comparable amount of queriesgt Hybrid and Shared
are the same.

42
Results for N3

Group 3 Hybrid reduces some joins/query, but
increased the queries by a lot gt Hybrid
generates more joins than Shared.
Hybrid and Shared performed similarly in both
joins/query and of queries gt Hybrid and
Shared are about the same.

43
Semi-Structured Queries to SQL

Semi-structured query languages
Allow path expressions with various operators and
wildcards.
XML-QL Query Lorel

44
Simple Path to SQL

The relations corresponding to the start of the
root path is added to the FROM clause.
If needed, the path expressions are translated to
joins.

45
Simple Recursive Path to SQL

Find initialization of the recursion (e.g.
.monograph.editor with condition
monograph.title Subclass Cirripedia)
Find the actual recursive path expression (e.g.
monograph.editor)
Union the two

46
Arbitrary Path to Simple Recursive Path

Use a general technique to translate path
expressions to many simple (recursive) path
expressions.

47
Relational Results to XML Simple Structuring

Requires only attaching appropriate tags to each
tuple.

48
Relational Results to XML Tag Variables

Have the relational query contain the tag value
in the result tuple. Then just covert it to a
tag during XML generation.

49
Grouping

Could sort the result tuples by the group-by
field and and scan through it in order when
generating the XML.
Could do a grouping operation.

50
Other Cases

Complex Element Construction
e.g. asking for all article elements and assume
that may be multiple elements (e.g. author
title)
Difficult to do in traditional relational model.
Heterogeneous Results
e.g. asking for either title or author of
article.
Could be done in two queries and then merged.

51
Other Cases