Title: Efficient%20Relational%20Storage%20and%20Retrieval%20of%20XML%20Documents
1Efficient Relational Storage and Retrieval of XML
- Jill ChenMojdeh Makabi
- CS240B
- Kanda Runapongsa and Jignesh M. Patel. Storing
and Querying XML Data in Object-Relational DBMSs.
In A.B. Chaudhri al. (Eds) EDBT 2002 Workshops,
LNCS 2490, pp.266-285, 2002. - H. Liefke and D. Suciu. XMill an Efficient
Compressor for XML Data. In Proceedings of the
ACM SIGMOD International Conference on Management
of Data, pp 153-164, Dallas, Texas, May 2000. - C. Kanne and G. Moerkotte. Efficient storage of
XML Data. et al. ICDE 2000. available at
- Albrecht Schmidt, Martin Kersten, Menzo
Windhouwer, and Florian Waas. Efficient
Relational Storage and Retrieval of XML
Documents. et al. WebDB 2000. available at
- XML assumes the role of the standard data
exchange format in Web database environments - XML is semi-structured and one consequence of
that is we can expect all instances of one type
to share the same structure - Modeling issues arises from the inconsistency
between semi-structured data on the one hand side
and fully structured database schemas on the
other hand - To make XML the language of Web databases, there
should be effective tools for the management of
the XML documents
4Monet XML Model
- Efficient Relational Storage and Retrieval of XML
Documents - The data model is based on the notion of binary
associations - It decomposes XML documents into small, flexible
and semantically homogenous units - It is very efficient
5XML documents and Syntax Tree
ltbibliographygt ltarticle key BB88gt ltauthorgtBen
Bitlt/authorgt lttitlegtHow to Hacklt/titlegt lt/articlegt
ltarticle key BK99gt lteditorgtEd
Itorlt/editorgt ltauthorgtBob Bytelt/authorgt ltauthorgtKe
n Keylt/authorgt lttitlegtHacking RSIlt/titlegt
lt/articlegt lt/ bibliography gt
6Main Question
- The question central to querying XML documents is
how to store the syntax tree as database instance
that provides efficient retrieval capabilities
7Different Approaches
- Tree could be stored using a single database
table - Makes querying expensive
- By enforcing scans over large amounts of data in
relevant to a query - With few Joins, large data volumes may have to
processed - Tree could be stored by storing all associations
of the same type in the same binary relation. - Being used in Monet XML Model
8Monet XML Model
- The basis for the Monet XML Model
- Paths
- Associations
- Binary Relations
- For a node o in the syntax tree, its path is the
sequence of labels along the path (vertex and
edge labels) from the root to o - Path describe the position of the element in the
graph relative to the root node
- A pair (o,.) ? oid x (oid U string) is called an
association - The different types of associations describe
different parts of the tree - Association of type oid x oid represents edges
- Association of type oid x string represents
attributes values
11Binary Relation
- In order to transform XML document to Monet
Model, we need to get the set of binary relations
that contain all associations between nodes - Store all association of the same type in the
same binary relation - Example
For association of bibliography and article
(O1, O2) , (O1, O7)
12Monet Transformation
Show Ben Bits publication whose titles contain
the word Hack
14Single Database Table
ltbibliographygt ltarticle key BB88gt ltauthorgtBen
Bitlt/authorgt lttitlegtHow to Hacklt/titlegt lt/articlegt
ltarticle key BK99gt lteditorgtEd
Itorlt/editorgt ltauthorgtBob Bytelt/authorgt ltauthorgtKe
n Keylt/authorgt lttitlegtHacking RSIlt/titlegt
lt/articlegt lt/ bibliography gt
SELECT FROM bibliography WHERE AuthorBen
Bit and t like Hack
Key Author title Editor
BB88 Ben Bit How to Hack NULL
BK99 Bob Byte Hacking RSI Ed Itor
BK99 Ken Key Hacking RSI Ed Itor
- Disadvantages
- Scans over large amounts of data
- Large data volumes may have to be processed by
few joins - Add NULL values for irregularities
15Monet XML Model
- Results in higher degree of fragmentation
- In our example, we have 11 tables
- Path is used to group semantically related
associations into the same relation. - No need to scan the entire documents
- There is no need to introduce novel features on
the storage level to cope with irregularities
induced by semi-structured nature of XML - The complete decomposition is linear in the size
of the documents - Memory requirements is linear in the height of
the syntax tree
16Quantitative Assessments
- Database Size
- Resulting size of the decomposition scheme are a
critical issues - In the worst case, the size of the path summary
can be linear in the size of the documents if
the documents are completely unstructured - In practical applications, there are generally
large structured portions - The Monet XML version of the ACM anthology is of
smaller size than the original documents - Reduction is due to the removal of redundancy
occurring character data and removal of tags
Documents Size in XML Size in Monet XML Tables Loading
ACM Anthology 46.6 MB 44.2 MB 187 30.4s
Shakespeare's Plays 7.9 MB 8.2 MB 95 4.5s
17Comparison of Response Times
- Comparing Monet XML against SYU/Postgres
- SYU store all data on a single table and have to
scan these data repeatedly - Monet transform yields smaller data volumes
- We have a set of 10 queries using Shakespeare's
plays - The substantial difference in response time shows
that Monet XML outruns the competitor by up to
two orders of magnitude
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Monet XML 1.2ms 5.6 6.8 8.0 4.4 4.9 5.0 5.0 8.8 12.7
SYU 150ms 180 160 180 190 340 350 370 1300 1040
- Presented a data model for efficient processing
of XML documents - The experiences show that it is worth taking the
plunge and fully decompose XML documents into
binary associations - This approach combines the elegance of clear
semantics with a highly efficient execution model
by means of a simple and effective mapping
between XML documents and a relational schema
19XORator Object-Relational DBMSs
20Two Dominating Approaches
- Use a native XML database engine for storing and
querying data sets - Provide a more natural data model and query
language for XML data hierarchical or graph
representation - Map the XML data and queries to constructs
provided by Relational DBMS (RDBMS) - XML data is mapped to relations, queries on XML
data are converted into SQL queries
- Advantage
- user is not involved in the complexity of mapping
- it can be used for querying both XML data and
data that exists in the relational systems - Disadvantage
- it can lower performance since a mapping from XML
data to the relational data may produce a
database schema with many relations - queries on XML data when translated to SQL
queries may have many joins, making the queries
expensive to evaluate
22In the Paper
- Object-Relational DBMS (ORDBMS)
- Has all the advantages of an RDBMS
- More expressive type system than RDBMS
- Better suited for XML documents that may use a
richer set of data types - XORator Algorithm
- Uses Document Type Definitions (DTDs) to map XML
documents to tables in ORDBMS - New XML data type XADT (XML Abstract Data Type)
23Storing XML Documents in an ORDBMS Reducing DTD
- Apply transformations to reduce the number of
nested expressions and the number of element
items, making the mapping process easier - Flattening (to convert a nested definition into a
flat representation) (e1, e2) ? e1, e2 - Simplification (to reduce multiple unary
operators into a single unary operator) e1 ?
e1 - Grouping (to group subelements that have the same
name) e0, e1, e1, e2 ? e0, e1, e2 - e ? e
24Reducing DTD Complexity (cont.)
25Storing XML Documents in an ORDBMS Building a
DTD Graph
26Storing XML Documents in an ORDBMS XORator
- XML to OR Translator
- Algorithm builds on Hybrid Algorithm
- If a non-leaf node N has exactly one parent, and
if there are no links incident on any of the
descendants of this node, then node N is assigned
to an XADT attribute. (If node N is assigned to a
relation, then queries on this node and its
parent requires a join.)
27XORator (cont.)
- If a non-leaf node below a node is accessed by
multiple nodes, then it is assigned to a
relation. (For nodes that are mapped to
relations, the ancestors of these nodes must also
be assigned as relations.) e.g. scene - If a leaf node is below a node, then it is
assigned as an attribute of the XADT. Otherwise,
it is assigned as an attribute of string type.
e.g. line
28XORator (cont.)
29Storing XML Documents in an ORDBMS Defining an
XML Data Type
- Compressed representation for the XML fragment
- Element tags are mapped to integer codes, and
element tags are replaced by these integer codes. - A small dictionary is stored along with the XML
fragment to record the mapping between the
integer codes and the actual element tag names. - Compression is used only if the space efficiency
is above a certain threshold value.
30Defining an XML Data Type (XADT) (cont.)
- Methods on the XADT
searchElm, VARCHAR searchKey, INTEGER level) - INTEGER findKeyInElm(XADT in XML, VARCHAR
searchElm, VARCHAR searchKey) - XADT getElmIndex(XADT inXML, VARCHAR parentElm,
31Defining an XML Data Type (XADT) (cont.)
32Defining an XML Data Type (XADT) (cont.)
- Unnest Operator
- Required when a query needs to examine individual
elements in the set. - E.g. A distinct list of all speakers who speak in
at least one play. - Implemented using a table User-Defined Function
33Defining an XML Data Type (XADT) Unnest
Operator (cont.)
34Performance Evaluation
- Randomly parse a few sample documents to obtain
the storage space sizes in both uncompressed and
compressed cases. Compressed format is chosen
only if it reduces the storage space by at least
35Performance Shakespeare Plays
- XORator algorithm chooses not to use the
compressed storage alternative. - The size of the database produced by the XORator
algorithm is about 60 of the size of the
database produced by the Hybrid algorithm.
36Performance Larger Data Set
- Took the original Shakespeare data set and loaded
it multiple times, producing data sets that were
two, four and eight times the original database
size (DSx2, DSx4, and DSx8). - Query sets
- QS1 Flattening list speakers and the lines
that they speak - QS2 Full path expression retrieve the lines
that have the keyword Rising in the text of the
stage direction - QS3 Selection
- QS4 Multiple selections
- QS5 Twig with selection
- QS6 Order access
37Performance Larger Data Set
- Much less loading times
- Significantly better execution times for all
queries, except query QS6 - All queries requested at least one few join
- QS6 is slower because the database needs to scan
the XADT attribute to extract elements in the
specified order when using the XORator algorithm,
while the Hybrid database needs to only extract
the value of the element order attribute
38Performance SIGMOD Proceedings Data Set
- Deep DTD representative of the worst-case
scenario for the XORator algorithm. - Compressed storage alternative is used it
reduces the database size by about 38. - The size of the database produced by the XORator
algorithm is about 65 of the size of the
database produced by the Hybrid algorithm
39Performance Larger Data Set
- Took the original SIGMOD Proceedings data set and
loaded it multiple times, producing data sets
that were two, four and eight times the original
database size (DPx2, DPx4, and DPx8). - Query Sets
- QG1 Selection and extraction retrieve the
authors of the papers with the keyword join in
the paper title - QG2 Flattening list all authors and the names
of the proceeding sections in which their papers
appear - QG3 Flattening with selection
- QG4 Aggregation
- QG5 Aggregation with selection
- QG6 Order access with selection
40Performance Larger Data Set
- When the size of data is small (DPx1 and DPx2),
the XORator algorithm performs worse than the
Hybrid algorithm. - When the size of data becomes large (DPx4 and
DPx8), the XORator algorithm outperforms the
Hybrid algorithm. - No table joins, but each query has 4 to 8 calls
of UDFs to extract subelements or to join
elements inside XADT attribute.
- The cost of invoking UDFs is significant
component of the query evaluation of XORator
algorithm. - Does UDF incur a higher performance penalty than
an equivalent built-in function? - Implement two string functions to return length
and substring using UDFs and built-in functions,
and test the following queries. - QT1 Return the length of string in the SPEAKER
attribute. - QT2 Return a substring of string in the SPEAKER
attribute from the fifth position to the last
42Analysis (cont.)
- Using UDFs is about 40 more expensive than using
built-in functions.
43Analysis (cont.)
- Invoking UDFs are expensive because
- XADT methods use string compare and copy
functions on VARCHAR. This sometimes requires
scanning a large amount of data. - Associate metadata with each XADT attribute to
quickly access the starting position of each
element. - Cost of evaluating UDF is higher compared to
equivalent built-in function. - Implement XADT as a native data type
- As the data size increases, the ratios of the
response times between two algorithms become more
than 1. - Queries using the XORator algorithm have no join
and thus the response time grow at O(n) rate
(scan cost), n of tuples - Queries using the Hybrid algorithm have many
joins grow at either O(nlogn) rate (merge sort
join cost), or O(n2) rate (nested loop join cost).
- New algorithm XORator
- New data type XADT
- Outperforms Hybrid algorithm due to less joins
- Future work Implementation and evaluation of UDF
- We presented some efficient models for storing
and querying XML documents - Monet XML Model
- XORator Algorithm
- There is still a lot of work that needs to be
done in order to bridge the gap between the
structured web databases and semi-structured XML