Efficient%20Relational%20Storage%20and%20Retrieval%20of%20XML%20Documents - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient%20Relational%20Storage%20and%20Retrieval%20of%20XML%20Documents

Description:

... order to transform XML document to Monet Model, we need to get the ... Comparing Monet XML against SYU/Postgres ... Monet transform yields smaller data volumes ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 47

Provided by: Jil797

Learn more at: http://web.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient%20Relational%20Storage%20and%20Retrieval%20of%20XML%20Documents

1
Efficient Relational Storage and Retrieval of XML
Documents

Jill ChenMojdeh Makabi
CS240B

2
References

Kanda Runapongsa and Jignesh M. Patel. Storing
and Querying XML Data in Object-Relational DBMSs.
In A.B. Chaudhri al. (Eds) EDBT 2002 Workshops,
LNCS 2490, pp.266-285, 2002.
H. Liefke and D. Suciu. XMill an Efficient
Compressor for XML Data. In Proceedings of the
ACM SIGMOD International Conference on Management
of Data, pp 153-164, Dallas, Texas, May 2000.
C. Kanne and G. Moerkotte. Efficient storage of
XML Data. et al. ICDE 2000. available at
http//citeseer.nj.nec.com/kanne99efficient.html
Albrecht Schmidt, Martin Kersten, Menzo
Windhouwer, and Florian Waas. Efficient
Relational Storage and Retrieval of XML
Documents. et al. WebDB 2000. available at
http//www.research.att.com/conf/webdb2000/progra
m.html

3
XML

XML assumes the role of the standard data
exchange format in Web database environments
XML is semi-structured and one consequence of
that is we can expect all instances of one type
to share the same structure
Modeling issues arises from the inconsistency
between semi-structured data on the one hand side
and fully structured database schemas on the
other hand
To make XML the language of Web databases, there
should be effective tools for the management of
the XML documents

4
Monet XML Model

Efficient Relational Storage and Retrieval of XML
Documents
The data model is based on the notion of binary
associations
It decomposes XML documents into small, flexible
and semantically homogenous units
It is very efficient

5
XML documents and Syntax Tree
ltbibliographygt ltarticle key BB88gt ltauthorgtBen
Bitlt/authorgt lttitlegtHow to Hacklt/titlegt lt/articlegt
ltarticle key BK99gt lteditorgtEd
Itorlt/editorgt ltauthorgtBob Bytelt/authorgt ltauthorgtKe
n Keylt/authorgt lttitlegtHacking RSIlt/titlegt
lt/articlegt lt/ bibliography gt
6
Main Question

The question central to querying XML documents is
how to store the syntax tree as database instance
that provides efficient retrieval capabilities

7
Different Approaches

Tree could be stored using a single database
table
Makes querying expensive
By enforcing scans over large amounts of data in
relevant to a query
With few Joins, large data volumes may have to
processed
Tree could be stored by storing all associations
of the same type in the same binary relation.
Being used in Monet XML Model

8
Monet XML Model

The basis for the Monet XML Model
Paths
Associations
Binary Relations

9
Path

For a node o in the syntax tree, its path is the
sequence of labels along the path (vertex and
edge labels) from the root to o
Path describe the position of the element in the
graph relative to the root node

10
Associations

A pair (o,.) ? oid x (oid U string) is called an
association
The different types of associations describe
different parts of the tree
Association of type oid x oid represents edges
Association of type oid x string represents
attributes values

11
Binary Relation

In order to transform XML document to Monet
Model, we need to get the set of binary relations
that contain all associations between nodes
Store all association of the same type in the
same binary relation
Example

For association of bibliography and article
(O1, O2) , (O1, O7)
12
Monet Transformation
13
Query
Show Ben Bits publication whose titles contain
the word Hack
14
Single Database Table
ltbibliographygt ltarticle key BB88gt ltauthorgtBen
Bitlt/authorgt lttitlegtHow to Hacklt/titlegt lt/articlegt
ltarticle key BK99gt lteditorgtEd
Itorlt/editorgt ltauthorgtBob Bytelt/authorgt ltauthorgtKe
n Keylt/authorgt lttitlegtHacking RSIlt/titlegt
lt/articlegt lt/ bibliography gt
SELECT FROM bibliography WHERE AuthorBen
Bit and t like Hack
Key Author title Editor
BB88 Ben Bit How to Hack NULL
BK99 Bob Byte Hacking RSI Ed Itor
BK99 Ken Key Hacking RSI Ed Itor
.

Disadvantages
Scans over large amounts of data
Large data volumes may have to be processed by
few joins
Add NULL values for irregularities

15
Monet XML Model

Results in higher degree of fragmentation
In our example, we have 11 tables
Path is used to group semantically related
associations into the same relation.
No need to scan the entire documents
There is no need to introduce novel features on
the storage level to cope with irregularities
induced by semi-structured nature of XML
The complete decomposition is linear in the size
of the documents
Memory requirements is linear in the height of
the syntax tree

16
Quantitative Assessments

Database Size
Resulting size of the decomposition scheme are a
critical issues
In the worst case, the size of the path summary
can be linear in the size of the documents if
the documents are completely unstructured
In practical applications, there are generally
large structured portions
The Monet XML version of the ACM anthology is of
smaller size than the original documents
Reduction is due to the removal of redundancy
occurring character data and removal of tags

Documents Size in XML Size in Monet XML Tables Loading
ACM Anthology 46.6 MB 44.2 MB 187 30.4s
Shakespeare's Plays 7.9 MB 8.2 MB 95 4.5s
17
Comparison of Response Times

Comparing Monet XML against SYU/Postgres
SYU store all data on a single table and have to
scan these data repeatedly
Monet transform yields smaller data volumes
We have a set of 10 queries using Shakespeare's
plays
The substantial difference in response time shows
that Monet XML outruns the competitor by up to
two orders of magnitude

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Monet XML 1.2ms 5.6 6.8 8.0 4.4 4.9 5.0 5.0 8.8 12.7
SYU 150ms 180 160 180 190 340 350 370 1300 1040
18
Summary

Presented a data model for efficient processing
of XML documents
The experiences show that it is worth taking the
plunge and fully decompose XML documents into
binary associations
This approach combines the elegance of clear
semantics with a highly efficient execution model
by means of a simple and effective mapping
between XML documents and a relational schema

19
XORator Object-Relational DBMSs
20
Two Dominating Approaches

Use a native XML database engine for storing and
querying data sets
Provide a more natural data model and query
language for XML data hierarchical or graph
representation
Map the XML data and queries to constructs
provided by Relational DBMS (RDBMS)
XML data is mapped to relations, queries on XML
data are converted into SQL queries

21
RDBMS

Advantage
user is not involved in the complexity of mapping
it can be used for querying both XML data and
data that exists in the relational systems
Disadvantage
it can lower performance since a mapping from XML
data to the relational data may produce a
database schema with many relations
queries on XML data when translated to SQL
queries may have many joins, making the queries
expensive to evaluate

22
In the Paper

Object-Relational DBMS (ORDBMS)
Has all the advantages of an RDBMS
More expressive type system than RDBMS
Better suited for XML documents that may use a
richer set of data types
XORator Algorithm
Uses Document Type Definitions (DTDs) to map XML
documents to tables in ORDBMS
New XML data type XADT (XML Abstract Data Type)

23
Storing XML Documents in an ORDBMS Reducing DTD
Complexity

Apply transformations to reduce the number of
nested expressions and the number of element
items, making the mapping process easier
Flattening (to convert a nested definition into a
flat representation) (e1, e2) ? e1, e2
Simplification (to reduce multiple unary
operators into a single unary operator) e1 ?
e1
Grouping (to group subelements that have the same
name) e0, e1, e1, e2 ? e0, e1, e2
e ? e

24
Reducing DTD Complexity (cont.)
25
Storing XML Documents in an ORDBMS Building a
DTD Graph
26
Storing XML Documents in an ORDBMS XORator

XML to OR Translator
Algorithm builds on Hybrid Algorithm
If a non-leaf node N has exactly one parent, and
if there are no links incident on any of the
descendants of this node, then node N is assigned
to an XADT attribute. (If node N is assigned to a
relation, then queries on this node and its
parent requires a join.)

27
XORator (cont.)

If a non-leaf node below a node is accessed by
multiple nodes, then it is assigned to a
relation. (For nodes that are mapped to
relations, the ancestors of these nodes must also
be assigned as relations.) e.g. scene
If a leaf node is below a node, then it is
assigned as an attribute of the XADT. Otherwise,
it is assigned as an attribute of string type.
e.g. line

28
XORator (cont.)
29
Storing XML Documents in an ORDBMS Defining an
XML Data Type

Compressed representation for the XML fragment
Element tags are mapped to integer codes, and
element tags are replaced by these integer codes.
A small dictionary is stored along with the XML
fragment to record the mapping between the
integer codes and the actual element tag names.
Compression is used only if the space efficiency
is above a certain threshold value.

30
Defining an XML Data Type (XADT) (cont.)

Methods on the XADT
XADT getElm(XADT inXML, VARCHAR rootElm, VARCHAR
searchElm, VARCHAR searchKey, INTEGER level)
INTEGER findKeyInElm(XADT in XML, VARCHAR
searchElm, VARCHAR searchKey)
XADT getElmIndex(XADT inXML, VARCHAR parentElm,
VARCHAR childElm, INTEGER startPos, INTEGER
endPos)

31
Defining an XML Data Type (XADT) (cont.)
32
Defining an XML Data Type (XADT) (cont.)

Unnest Operator
Required when a query needs to examine individual
elements in the set.
E.g. A distinct list of all speakers who speak in
at least one play.
Implemented using a table User-Defined Function
(UDF).

33
Defining an XML Data Type (XADT) Unnest
Operator (cont.)
34
Performance Evaluation

Randomly parse a few sample documents to obtain
the storage space sizes in both uncompressed and
compressed cases. Compressed format is chosen
only if it reduces the storage space by at least
20

35
Performance Shakespeare Plays

XORator algorithm chooses not to use the
compressed storage alternative.
The size of the database produced by the XORator
algorithm is about 60 of the size of the
database produced by the Hybrid algorithm.

36
Performance Larger Data Set

Took the original Shakespeare data set and loaded
it multiple times, producing data sets that were
two, four and eight times the original database
size (DSx2, DSx4, and DSx8).
Query sets
QS1 Flattening list speakers and the lines
that they speak
QS2 Full path expression retrieve the lines
that have the keyword Rising in the text of the
stage direction
QS3 Selection
QS4 Multiple selections
QS5 Twig with selection
QS6 Order access

37
Performance Larger Data Set

Much less loading times
Significantly better execution times for all
queries, except query QS6
All queries requested at least one few join
QS6 is slower because the database needs to scan
the XADT attribute to extract elements in the
specified order when using the XORator algorithm,
while the Hybrid database needs to only extract
the value of the element order attribute

38
Performance SIGMOD Proceedings Data Set

Deep DTD representative of the worst-case
scenario for the XORator algorithm.
Compressed storage alternative is used it
reduces the database size by about 38.
The size of the database produced by the XORator
algorithm is about 65 of the size of the
database produced by the Hybrid algorithm

39
Performance Larger Data Set

Took the original SIGMOD Proceedings data set and
loaded it multiple times, producing data sets
that were two, four and eight times the original
database size (DPx2, DPx4, and DPx8).
Query Sets
QG1 Selection and extraction retrieve the
authors of the papers with the keyword join in
the paper title
QG2 Flattening list all authors and the names
of the proceeding sections in which their papers
appear
QG3 Flattening with selection
QG4 Aggregation
QG5 Aggregation with selection
QG6 Order access with selection

40
Performance Larger Data Set

When the size of data is small (DPx1 and DPx2),
the XORator algorithm performs worse than the
Hybrid algorithm.
When the size of data becomes large (DPx4 and
DPx8), the XORator algorithm outperforms the
Hybrid algorithm.
No table joins, but each query has 4 to 8 calls
of UDFs to extract subelements or to join
elements inside XADT attribute.

41
Analysis

The cost of invoking UDFs is significant
component of the query evaluation of XORator
algorithm.
Does UDF incur a higher performance penalty than
an equivalent built-in function?
Implement two string functions to return length
and substring using UDFs and built-in functions,
and test the following queries.
QT1 Return the length of string in the SPEAKER
attribute.
QT2 Return a substring of string in the SPEAKER
attribute from the fifth position to the last
position.

42
Analysis (cont.)

Using UDFs is about 40 more expensive than using
built-in functions.

43
Analysis (cont.)

Invoking UDFs are expensive because
XADT methods use string compare and copy
functions on VARCHAR. This sometimes requires
scanning a large amount of data.
Associate metadata with each XADT attribute to
quickly access the starting position of each
element.
Cost of evaluating UDF is higher compared to
equivalent built-in function.
Implement XADT as a native data type

44
Performance

As the data size increases, the ratios of the
response times between two algorithms become more
than 1.
Queries using the XORator algorithm have no join
and thus the response time grow at O(n) rate
(scan cost), n of tuples
Queries using the Hybrid algorithm have many
joins grow at either O(nlogn) rate (merge sort
join cost), or O(n2) rate (nested loop join cost).

45
Summary

New algorithm XORator
New data type XADT
Outperforms Hybrid algorithm due to less joins
Future work Implementation and evaluation of UDF

46
Conclusion

We presented some efficient models for storing
and querying XML documents
Monet XML Model
XORator Algorithm
There is still a lot of work that needs to be
done in order to bridge the gap between the
structured web databases and semi-structured XML
documents

Write a Comment

User Comments (0)