Efficient%20Relational%20Storage%20and%20Retrieval%20of%20XML%20Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient%20Relational%20Storage%20and%20Retrieval%20of%20XML%20Documents

Description:

... order to transform XML document to Monet Model, we need to get the ... Comparing Monet XML against SYU/Postgres ... Monet transform yields smaller data volumes ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 47
Provided by: Jil797
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient%20Relational%20Storage%20and%20Retrieval%20of%20XML%20Documents


1
Efficient Relational Storage and Retrieval of XML
Documents
  • Jill ChenMojdeh Makabi
  • CS240B

2
References
  • Kanda Runapongsa and Jignesh M. Patel. Storing
    and Querying XML Data in Object-Relational DBMSs.
    In A.B. Chaudhri al. (Eds) EDBT 2002 Workshops,
    LNCS 2490, pp.266-285, 2002.
  • H. Liefke and D. Suciu. XMill an Efficient
    Compressor for XML Data. In Proceedings of the
    ACM SIGMOD International Conference on Management
    of Data, pp 153-164, Dallas, Texas, May 2000.
  • C. Kanne and G. Moerkotte. Efficient storage of
    XML Data. et al. ICDE 2000. available at
    http//citeseer.nj.nec.com/kanne99efficient.html
  • Albrecht Schmidt, Martin Kersten, Menzo
    Windhouwer, and Florian Waas. Efficient
    Relational Storage and Retrieval of XML
    Documents. et al. WebDB 2000. available at
    http//www.research.att.com/conf/webdb2000/progra
    m.html

3
XML
  • XML assumes the role of the standard data
    exchange format in Web database environments
  • XML is semi-structured and one consequence of
    that is we can expect all instances of one type
    to share the same structure
  • Modeling issues arises from the inconsistency
    between semi-structured data on the one hand side
    and fully structured database schemas on the
    other hand
  • To make XML the language of Web databases, there
    should be effective tools for the management of
    the XML documents

4
Monet XML Model
  • Efficient Relational Storage and Retrieval of XML
    Documents
  • The data model is based on the notion of binary
    associations
  • It decomposes XML documents into small, flexible
    and semantically homogenous units
  • It is very efficient

5
XML documents and Syntax Tree
ltbibliographygt ltarticle key BB88gt ltauthorgtBen
Bitlt/authorgt lttitlegtHow to Hacklt/titlegt lt/articlegt
ltarticle key BK99gt lteditorgtEd
Itorlt/editorgt ltauthorgtBob Bytelt/authorgt ltauthorgtKe
n Keylt/authorgt lttitlegtHacking RSIlt/titlegt
lt/articlegt lt/ bibliography gt
6
Main Question
  • The question central to querying XML documents is
    how to store the syntax tree as database instance
    that provides efficient retrieval capabilities

7
Different Approaches
  • Tree could be stored using a single database
    table
  • Makes querying expensive
  • By enforcing scans over large amounts of data in
    relevant to a query
  • With few Joins, large data volumes may have to
    processed
  • Tree could be stored by storing all associations
    of the same type in the same binary relation.
  • Being used in Monet XML Model

8
Monet XML Model
  • The basis for the Monet XML Model
  • Paths
  • Associations
  • Binary Relations

9
Path
  • For a node o in the syntax tree, its path is the
    sequence of labels along the path (vertex and
    edge labels) from the root to o
  • Path describe the position of the element in the
    graph relative to the root node

10
Associations
  • A pair (o,.) ? oid x (oid U string) is called an
    association
  • The different types of associations describe
    different parts of the tree
  • Association of type oid x oid represents edges
  • Association of type oid x string represents
    attributes values

11
Binary Relation
  • In order to transform XML document to Monet
    Model, we need to get the set of binary relations
    that contain all associations between nodes
  • Store all association of the same type in the
    same binary relation
  • Example

For association of bibliography and article
(O1, O2) , (O1, O7)
12
Monet Transformation
13
Query
Show Ben Bits publication whose titles contain
the word Hack
14
Single Database Table
ltbibliographygt ltarticle key BB88gt ltauthorgtBen
Bitlt/authorgt lttitlegtHow to Hacklt/titlegt lt/articlegt
ltarticle key BK99gt lteditorgtEd
Itorlt/editorgt ltauthorgtBob Bytelt/authorgt ltauthorgtKe
n Keylt/authorgt lttitlegtHacking RSIlt/titlegt
lt/articlegt lt/ bibliography gt
SELECT FROM bibliography WHERE AuthorBen
Bit and t like Hack
Key Author title Editor
BB88 Ben Bit How to Hack NULL
BK99 Bob Byte Hacking RSI Ed Itor
BK99 Ken Key Hacking RSI Ed Itor
.
  • Disadvantages
  • Scans over large amounts of data
  • Large data volumes may have to be processed by
    few joins
  • Add NULL values for irregularities

15
Monet XML Model
  • Results in higher degree of fragmentation
  • In our example, we have 11 tables
  • Path is used to group semantically related
    associations into the same relation.
  • No need to scan the entire documents
  • There is no need to introduce novel features on
    the storage level to cope with irregularities
    induced by semi-structured nature of XML
  • The complete decomposition is linear in the size
    of the documents
  • Memory requirements is linear in the height of
    the syntax tree

16
Quantitative Assessments
  • Database Size
  • Resulting size of the decomposition scheme are a
    critical issues
  • In the worst case, the size of the path summary
    can be linear in the size of the documents if
    the documents are completely unstructured
  • In practical applications, there are generally
    large structured portions
  • The Monet XML version of the ACM anthology is of
    smaller size than the original documents
  • Reduction is due to the removal of redundancy
    occurring character data and removal of tags

Documents Size in XML Size in Monet XML Tables Loading
ACM Anthology 46.6 MB 44.2 MB 187 30.4s
Shakespeare's Plays 7.9 MB 8.2 MB 95 4.5s
17
Comparison of Response Times
  • Comparing Monet XML against SYU/Postgres
  • SYU store all data on a single table and have to
    scan these data repeatedly
  • Monet transform yields smaller data volumes
  • We have a set of 10 queries using Shakespeare's
    plays
  • The substantial difference in response time shows
    that Monet XML outruns the competitor by up to
    two orders of magnitude

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Monet XML 1.2ms 5.6 6.8 8.0 4.4 4.9 5.0 5.0 8.8 12.7
SYU 150ms 180 160 180 190 340 350 370 1300 1040
18
Summary
  • Presented a data model for efficient processing
    of XML documents
  • The experiences show that it is worth taking the
    plunge and fully decompose XML documents into
    binary associations
  • This approach combines the elegance of clear
    semantics with a highly efficient execution model
    by means of a simple and effective mapping
    between XML documents and a relational schema

19
XORator Object-Relational DBMSs
20
Two Dominating Approaches
  • Use a native XML database engine for storing and
    querying data sets
  • Provide a more natural data model and query
    language for XML data hierarchical or graph
    representation
  • Map the XML data and queries to constructs
    provided by Relational DBMS (RDBMS)
  • XML data is mapped to relations, queries on XML
    data are converted into SQL queries

21
RDBMS
  • Advantage
  • user is not involved in the complexity of mapping
  • it can be used for querying both XML data and
    data that exists in the relational systems
  • Disadvantage
  • it can lower performance since a mapping from XML
    data to the relational data may produce a
    database schema with many relations
  • queries on XML data when translated to SQL
    queries may have many joins, making the queries
    expensive to evaluate

22
In the Paper
  • Object-Relational DBMS (ORDBMS)
  • Has all the advantages of an RDBMS
  • More expressive type system than RDBMS
  • Better suited for XML documents that may use a
    richer set of data types
  • XORator Algorithm
  • Uses Document Type Definitions (DTDs) to map XML
    documents to tables in ORDBMS
  • New XML data type XADT (XML Abstract Data Type)

23
Storing XML Documents in an ORDBMS Reducing DTD
Complexity
  • Apply transformations to reduce the number of
    nested expressions and the number of element
    items, making the mapping process easier
  • Flattening (to convert a nested definition into a
    flat representation) (e1, e2) ? e1, e2
  • Simplification (to reduce multiple unary
    operators into a single unary operator) e1 ?
    e1
  • Grouping (to group subelements that have the same
    name) e0, e1, e1, e2 ? e0, e1, e2
  • e ? e

24
Reducing DTD Complexity (cont.)
25
Storing XML Documents in an ORDBMS Building a
DTD Graph
26
Storing XML Documents in an ORDBMS XORator
  • XML to OR Translator
  • Algorithm builds on Hybrid Algorithm
  • If a non-leaf node N has exactly one parent, and
    if there are no links incident on any of the
    descendants of this node, then node N is assigned
    to an XADT attribute. (If node N is assigned to a
    relation, then queries on this node and its
    parent requires a join.)

27
XORator (cont.)
  • If a non-leaf node below a node is accessed by
    multiple nodes, then it is assigned to a
    relation. (For nodes that are mapped to
    relations, the ancestors of these nodes must also
    be assigned as relations.) e.g. scene
  • If a leaf node is below a node, then it is
    assigned as an attribute of the XADT. Otherwise,
    it is assigned as an attribute of string type.
    e.g. line

28
XORator (cont.)
29
Storing XML Documents in an ORDBMS Defining an
XML Data Type
  • Compressed representation for the XML fragment
  • Element tags are mapped to integer codes, and
    element tags are replaced by these integer codes.
  • A small dictionary is stored along with the XML
    fragment to record the mapping between the
    integer codes and the actual element tag names.
  • Compression is used only if the space efficiency
    is above a certain threshold value.

30
Defining an XML Data Type (XADT) (cont.)
  • Methods on the XADT
  • XADT getElm(XADT inXML, VARCHAR rootElm, VARCHAR
    searchElm, VARCHAR searchKey, INTEGER level)
  • INTEGER findKeyInElm(XADT in XML, VARCHAR
    searchElm, VARCHAR searchKey)
  • XADT getElmIndex(XADT inXML, VARCHAR parentElm,
    VARCHAR childElm, INTEGER startPos, INTEGER
    endPos)

31
Defining an XML Data Type (XADT) (cont.)
32
Defining an XML Data Type (XADT) (cont.)
  • Unnest Operator
  • Required when a query needs to examine individual
    elements in the set.
  • E.g. A distinct list of all speakers who speak in
    at least one play.
  • Implemented using a table User-Defined Function
    (UDF).

33
Defining an XML Data Type (XADT) Unnest
Operator (cont.)
34
Performance Evaluation
  • Randomly parse a few sample documents to obtain
    the storage space sizes in both uncompressed and
    compressed cases. Compressed format is chosen
    only if it reduces the storage space by at least
    20

35
Performance Shakespeare Plays
  • XORator algorithm chooses not to use the
    compressed storage alternative.
  • The size of the database produced by the XORator
    algorithm is about 60 of the size of the
    database produced by the Hybrid algorithm.

36
Performance Larger Data Set
  • Took the original Shakespeare data set and loaded
    it multiple times, producing data sets that were
    two, four and eight times the original database
    size (DSx2, DSx4, and DSx8).
  • Query sets
  • QS1 Flattening list speakers and the lines
    that they speak
  • QS2 Full path expression retrieve the lines
    that have the keyword Rising in the text of the
    stage direction
  • QS3 Selection
  • QS4 Multiple selections
  • QS5 Twig with selection
  • QS6 Order access

37
Performance Larger Data Set
  • Much less loading times
  • Significantly better execution times for all
    queries, except query QS6
  • All queries requested at least one few join
  • QS6 is slower because the database needs to scan
    the XADT attribute to extract elements in the
    specified order when using the XORator algorithm,
    while the Hybrid database needs to only extract
    the value of the element order attribute

38
Performance SIGMOD Proceedings Data Set
  • Deep DTD representative of the worst-case
    scenario for the XORator algorithm.
  • Compressed storage alternative is used it
    reduces the database size by about 38.
  • The size of the database produced by the XORator
    algorithm is about 65 of the size of the
    database produced by the Hybrid algorithm

39
Performance Larger Data Set
  • Took the original SIGMOD Proceedings data set and
    loaded it multiple times, producing data sets
    that were two, four and eight times the original
    database size (DPx2, DPx4, and DPx8).
  • Query Sets
  • QG1 Selection and extraction retrieve the
    authors of the papers with the keyword join in
    the paper title
  • QG2 Flattening list all authors and the names
    of the proceeding sections in which their papers
    appear
  • QG3 Flattening with selection
  • QG4 Aggregation
  • QG5 Aggregation with selection
  • QG6 Order access with selection

40
Performance Larger Data Set
  • When the size of data is small (DPx1 and DPx2),
    the XORator algorithm performs worse than the
    Hybrid algorithm.
  • When the size of data becomes large (DPx4 and
    DPx8), the XORator algorithm outperforms the
    Hybrid algorithm.
  • No table joins, but each query has 4 to 8 calls
    of UDFs to extract subelements or to join
    elements inside XADT attribute.

41
Analysis
  • The cost of invoking UDFs is significant
    component of the query evaluation of XORator
    algorithm.
  • Does UDF incur a higher performance penalty than
    an equivalent built-in function?
  • Implement two string functions to return length
    and substring using UDFs and built-in functions,
    and test the following queries.
  • QT1 Return the length of string in the SPEAKER
    attribute.
  • QT2 Return a substring of string in the SPEAKER
    attribute from the fifth position to the last
    position.

42
Analysis (cont.)
  • Using UDFs is about 40 more expensive than using
    built-in functions.

43
Analysis (cont.)
  • Invoking UDFs are expensive because
  • XADT methods use string compare and copy
    functions on VARCHAR. This sometimes requires
    scanning a large amount of data.
  • Associate metadata with each XADT attribute to
    quickly access the starting position of each
    element.
  • Cost of evaluating UDF is higher compared to
    equivalent built-in function.
  • Implement XADT as a native data type

44
Performance
  • As the data size increases, the ratios of the
    response times between two algorithms become more
    than 1.
  • Queries using the XORator algorithm have no join
    and thus the response time grow at O(n) rate
    (scan cost), n of tuples
  • Queries using the Hybrid algorithm have many
    joins grow at either O(nlogn) rate (merge sort
    join cost), or O(n2) rate (nested loop join cost).

45
Summary
  • New algorithm XORator
  • New data type XADT
  • Outperforms Hybrid algorithm due to less joins
  • Future work Implementation and evaluation of UDF

46
Conclusion
  • We presented some efficient models for storing
    and querying XML documents
  • Monet XML Model
  • XORator Algorithm
  • There is still a lot of work that needs to be
    done in order to bridge the gap between the
    structured web databases and semi-structured XML
    documents
Write a Comment
User Comments (0)
About PowerShow.com