Title: Lecture 24: XML Data Management
1Lecture 24 XML Data Management
Nov. 17, 2006 ChengXiang Zhai
Most slides are from Ning Zhangs
presentation www2.cs.uh.edu/ceick/3480/XML-3480.
ppt
2What is XML?
- XML documents have elements and attributes
- Elements (indicated by begin end tags)
- can be nested but cannot interleave each other
- can have arbitrary number of sub-elements
- can have free text as values
- ltchap title Introduction To XMLgt
- some free text
- ltsect title What is XML?gt lt/sectgt
- ltsect title Elementsgt lt/sectgt
- ltsect title Why XML?gt lt/sectgt
- possibly more free text
- lt/chapgt
end element
attribute
begin element
Elements w/ same name can be nested
3XML
- ltbibliographygt
- ltbookgt lttitlegt Foundations lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
- ltauthorgt Hull lt/authorgt
- ltauthorgt Vianu lt/authorgt
- ltpublishergt Addison Wesley
lt/publishergt - ltyeargt 1995 lt/yeargt
- lt/bookgt
-
- lt/bibliographygt
XML describes the content easy for applications
4Document Type Definitions (DTDs) as Grammars
lt!DOCTYPE paper lt!ELEMENT paper
(section)gt lt!ELEMENT section ((title,section)
text)gt lt!ELEMENT title (PCDATA)gt
lt!ELEMENT text (PCDATA)gt gt
ltpapergt ltsectiongt lttextgt lt/textgt lt/sectiongt
ltsectiongt lttitlegt lt/titlegt ltsectiongt
lt/sectiongt
ltsectiongt lt/sectiongt
lt/sectiongt lt/papergt
XML documents can be nested arbitrarily deep
5XML for Representing Data
XML
persons
persons
row
row
row
phone
name
name
name
phone
phone
John
3634
Sue
Dick
6343
6363
- ltpersonsgt
- ltrowgt ltnamegtJohnlt/namegt
- ltphonegt 3634lt/phonegtlt/rowgt
- ltrowgt ltnamegtSuelt/namegt
- ltphonegt 6343lt/phonegt
- ltrowgt ltnamegtDicklt/namegt
- ltphonegt 6363lt/phonegtlt/rowgt
- lt/personsgt
6XML vs Data Models
- XML is self-describing
- Schema elements become part of the data
- Relational schema persons(name,phone)
- In XML ltpersonsgt, ltnamegt, ltphonegt are part of the
data, and are repeated many times - Consequence XML is much more flexible
- XML semistructured data
7Semi-structured Data Explained
- Missing attributes
- Repeated attributes
ltpersongt ltnamegt Johnlt/namegt
ltphonegt1234lt/phonegt lt/persongt ltpersongt
ltnamegtJoelt/namegt lt/persongt
? no phone !
ltpersongt ltnamegt Marylt/namegt
ltphonegt2345lt/phonegt
ltphonegt3456lt/phonegt lt/persongt
? two phones !
8Semistructured Data Explained
- Attributes with different types in different
objects - Nested collections
- Heterogeneous collections
- ltdbgt contains both ltbookgts and ltpublishergts
ltpersongt ltnamegt ltfirstgt John lt/firstgt
ltlastgt Smith lt/lastgt
lt/namegt
ltphonegt1234lt/phonegt lt/persongt
? structured name !
9Why XML?
chap
- Database Side XML is a new way to organize data
- Relational databases organize data in tables
- XML documents organize data in ordered trees
- Document Side XML is a semantic markup language
- HTML focuses on presentation while plain text has
no structure - XML focuses on semantics/structure in the data
sect
sect
sect
sect
sect
sect
lthtmlgt lth1gt Chapter 1 lt/h1gt some free
text lth2gt Section 1 lt/h2gt some more free
text lth3gt Section 1.1 lt/h3gt lt/htmlgt
10Data Management Relational vs. XML
- Relational data are well organized fully
structured (more strict) - E-R modeling to model the data structures in the
application - E-R diagram is converted to relational tables and
integrity constraints (relational schemas) - XML data are semi-structured (more flexible)
- Schemas may be unfixed, or unknown (flexible
anyone can author a document) - Suitable for data integration (data on the web,
data exchange between different enterprises).
11More about Relational vs. XML
- XML is not meant to replace relational database
systems - RDBMSs are well suited for OLTP applications
(e.g., electronic banking) which has 1000 small
transactions per minute. - XML is suitable for data exchange over
heterogeneous data sources (e.g., Web services)
that allow them to talk.
12Uses of XML
- As document representation language
- XML can be transformed to other format (e.g., by
XSLT) - XML ? HTML
- XML ? LaTeX, bibTeX
- XML ? PDF
- DocBook (standard schema for authoring
document/book)
13Uses of XML (cont.)
- As data integration and exchange language
- Web services (SOAP, WSDL, UDDI)
- Amazon.com, eBay, Microsoft MapPoint,
- Domain specific data exchange schemas (gt1000)
- legal document exchange language
- business information exchange
- RSS XML news feed
- CNN, slashdot, blogs,
14Uses of XML (cont.)
- In general, appropriate for any data having
hierarchical structure - Email
- Header from, to, cc, bcc
- Body my message, replied email
- Network log file
- IP address, time, request type, error code
15Exporting Relational Data to XML
- Product(pid, name, weight)
- Company(cid, name, address)
- Makes(pid, cid, price)
makes
product
company
16Export data grouped by companies
- ltdbgtltcompanygt ltnamegt GizmoWorks lt/namegt
- ltaddressgt Tacoma
lt/addressgt - ltproductgt ltnamegt gizmo
lt/namegt - ltpricegt
19.99 lt/pricegt - lt/productgt
- ltproductgt lt/productgt
-
- lt/companygt
- ltcompanygt ltnamegt Bang lt/namegt
- ltaddressgt Kirkland
lt/addressgt - ltproductgt ltnamegt gizmo
lt/namegt - ltpricegt 22.99 lt/pricegt
- lt/productgt
-
- lt/companygt
-
- lt/dbgt
Redundant representation of products
17The DTD
- lt!ELEMENT db (company)gt
- lt!ELEMENT company (name, address, product)gt
- lt!ELEMENT product (name,price)gt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT address (PCDATA)gt
- lt!ELEMENT price (PCDATA)gt
18Export Data by Products
- ltdbgt ltproductgt ltnamegt Gizmo lt/namegt
- ltmanufacturergt
- ltnamegt
GizmoWorks lt/namegt - ltpricegt
19.99 lt/pricegt - ltaddressgt
Tacoma lt/addressgt - lt/manufacturergt
- ltmanufacturergt
- ltnamegt Bang
lt/namegt - ltpricegt
22.99 lt/pricegt - ltaddressgt
Kirkland lt/addressgt - lt/manufacturergt
-
- lt/productgt
- ltproductgt ltnamegt OneClick lt/namegt
- lt/dbgt
Redundant Representation of companies
19Reminds us of the network data model
20A Data-Integration View of XML
- What should be the underlying data model for DI
contexts? - relational model is not an ideal choice
- Developed semi-structured data model
- started with the OEM (object exchange model) in
the project Lore - Then XML came along
- It is now the most well-known semi-structured
data model - Generating much research in the DB community
- Current standards XMLSchema, Xquery
(http//www.w3.org/XML/Query/)
21XML Databases
- Advantages
- Manage large volume of XML data
- Provide high-level declarative language
- Efficiently evaluate complex queries
- XML Data Management Issues
- XML Data Model
- XML Query Languages
- XML Query Processing and Optimization
22XML Data Model
- Hierarchical data model
- An XML document is an ordered tree
- Nodes in the tree are labeled with element names.
- Element nesting relationship corresponds to
parent-child relationship
chap
sect
_at_title
some free text
Introduction to XML
sect
_at_title
_at_title
What is XML?
23XML Schema Languages
- Schema language defines the structure
- Document Type Definition (DTD)
- Context-free grammar
- Structurally richer than relational schema
definition language because of recursion. - XML Schema
- Also context-free
- Richer than DTD because of data types definition
(integer, date, sequence).
24XML Query Languages
- XPath
- 13 axes (navigation directions in the tree)
- child (/), descendant (//), following-sibling,
following - NameTest, predicates
- E.g,
- doc(bib.xml)//booktitleHarry Potter/ISBN
- XQuery (superset of XPath)
- FLWOR expression
- for x in doc(bib.xml)//booktitle Harry
Potter/ISBN, - y in doc(imdb.xml)//movie
- where y//novel/ISBN x
- return y//title
25Important Problems in XML Data Management
- How to store XML data?
- How to efficiently evaluate XPath/XQuery
languages? - Efficient physical operators
- Query optimization
- How to support XML update languages?
- How to support transaction management?
- Recovery management?
26XML Storage
- Extended Relational Storage
- Convert XML documents to relational tables
- Native Storage
- Treat XML elements as first-class citizens
- Hybrid of Relational and Native Storage
- XML documents can be stored in columns of
relational tables (XML typed column)
27Extended Relational Storage
- Edge-based Storage Scheme (Florescu and Kossman
99) - Each node has an ID
- Each tuple in the edge table consists of
(parentID, childID, type of data, reference to
data) - Pro easy to convert XML to relational tables
- Con impossible to answer path queries such as
//a//b using SQL (needs transitive closure
operator)
28Extended Relational Storage
- Path-based Storage Scheme XRel (Yoshikawa et al.
01) - Each node corresponds to a tuple in the table
- Each tuple keeps a rooted path to the node (e.g.,
/article/chap/sec/sec/_at_title) - Pro also easy to convert XML to tables
- Con answering path queries, such as //a//b,
needs expensive string pattern matching
29Extended Relational Storage
- Node-based Storage Scheme Niagara, TIMBER etc.
(Zhang et al. 01) - Each node is encoded with a begin and end
integers. - Begin corresponds to the order of in-order
traversal of tree end corresponds to the order
in post-order traversal. - Pro checking parent-child/ancestor-descendant
relationships is efficient (constant time using
begin and end) - Con inefficient for updating XML
30Native Storage
- Subtree partition-based scheme Natix (Kanne and
Moerkotte 00) - A large XML tree is partitioned into small
subtrees, each of which can be fit into one disk
page - Introducing aproxy and aggregate nodes to connect
different subtrees - Pro easy to update and traversal
- Con complex update algorithm frequent
deletion/addition may deteriorate page usage ratio
31Native Storage
- Binary tree-based scheme Arb (Koch 03)
- Convert a tree with arbitrary number of children
to a binary tree (first child translates to left
child next sibling translate to right child) - Tree nodes are stored in document order
- Each node has 2 bits indicating whether it has a
left right child - Pro easy to do depth-first search (DFS)
traversal - Con inefficient to do next_sibling navigation
and hard to update
32Native Storage
- String-based scheme NoK (Zhang 04)
- Convert a tree to a parenthesized string
- E.g., a having b and c as children is converted
to ab)c)), by DFS of the tree and )
representing end-of-subtree - Tree can be reconstructed by the string
- A long string can be cut into substrings and fit
them into disk pages - Page header can contain simple statistics to
expedite next_sibling navigation - Pro particularly optimized for DFS navigational
evaluation plan - Con inefficient for breadth-first search (BFS)
33Hybrid of Relational and Native Storage
- All major commercial RDBMS vendaors (IBM, Oracle,
Microsoft and Sybase) support XML type in their
RDBMS - A table can have a column whose type is XML
- When inserting a tuple in the table, the XML
field could be an XML document - XML documents are stored natively
34Hybrid of Relational and Native Storage
- IBM DB2 UDB
- System RX XML storage is similar to Natix
- Microsoft SQL Server
- Uses BLOB (binary large object) to represent XML
documents - Oracle
- Can use multiple format
- CLOB (character large object)
- Serialized object
- Shredded relational table
35XML Path Processing
- Extended Relational Approach
- Translate XML queries to SQL statements
- Native Approach (may be based on extended
relational storage) - Join-based approach
- Navigational approach
- Hybrid approach
36Extended Relational Query Processing
- Regular expression based approach XRel
(Yoshikawa et al. 01) - Linear path expression (without branches) are
translated to regular expressions on strings
(rooted paths) - Use the like predicate in SQL to evaluate
regular expressions - Pro easy to implement
- Con cannot answer branching path queries
37Extended Relational Query Processing
- Dynamic Interval based approach DI (DeHaan et
al. 03) - Use the node labeling (begin,end) interval
storage scheme - Dynamically calculate (begin,end) intervals for
resulting nodes give a path/FLWOR expression - Pro can handle all types of queries including
FLWOR expression - Con inefficient for answering complex path
queries
38Native Path Query Processing
- Merge-Join based approach Multi-predicate Merge
Join (MPMGJN) algorithm (Zhang et al. 01) - Modify the merge join algorithm to reduce
unnecessary comparisons - Keep to position p of the last successful
comparisons in the right input stream - The next item from the left input stream starts
scanning from position p.
39Native Path Query Processing
- Stack-based Structural Join (Wu et al. 02)
- Improve the MPMGJN algorithm
- Do not look back but keep all ancestors in a
stack - When comparing the new item, just compare it with
the top of the stack
40Native Path Query Processing
- Holistic Twig Join (Bruno et al. 02)
- Improve the stack-based structure algorithm
- Use one join algorithm for the whole path
expression instead of one join for one step - Reduce the overhead to produce and store
intermediate results
41Native Path Query Processing
- Natix (Brantner et al. 05)
- Translate each step into a logical navigational
operator Unnest-Map - Each unnest-map operator is translated into a
physical operator that performs tree traversal on
the Natix storage - Physical optimization can be performed on the
physical navigational operators to reduce
cross-cluster I/O.
42Native Path Query Processing
- IBM DB2 XNav (Josifovski et al. 04)
- XML path expressions are translated into automata
- The automaton is constructed dynamically while
traversing the XML tree in DFS - Physical I/O can be optimized by navigating to
next_sibling without traversing the whole subtree
43Native Path Query Processing
- Tree automata (Koch 03)
- The tree automaton needs two passes of tree
- The first traversal is a bottom-up deterministic
tree automaton to determine which states are
reachable - The second traversal is a top-down tree automaton
to prune the reachable states and compute
predicates.
44Hybrid Processing
- BlossomTree (Zhang 04, Zhang05)
- Navigational approach is efficient for
parent-child navigation - Join-based approach is efficient for
ancestor-descendant - BlossomTree approach identifies sub-expressions,
Next-of-Kin (NoK), that are efficient for
navigational approach. - Use navigational approach for NoK subexpressions
and use structural joins to join intermediate
results
45XML Indexing
- Structural Index
- Clustering tree nodes by their structural
similarity - Index is a graph, in which each vertex is an
equivalence class of similar XML tree nodes - Path query evaluation amounts to navigational
evaluation on the graph
46Overview of Cost-based Optimization
- Query Optimization depends on
- How much knowledge about the data we have?
- How intelligent we can be in making use of the
knowledge (within a time constraint)? - The cost of a plan is heavily dependent on
- The cost model of each operator
- The cardinality/selectivity of each operator
47Cardinality Estimation
- Full path summarization DataGuide (Goldman 97)
and PathTree (Aboulnaga 01) - Summarize all distinct paths in XML documents in
a graph - Cardinality information is annotated on graph
vertices
48Cardinality Estimation
- Partial path summarization Markov Table
(Aboulnaga 01) - Keep sub-paths and cardinality information in a
table - Cardinality for longer paths are calculated using
partial paths. - Can use additional compression methods to
accommodate Internet scale database
49Cardinality Estimation
- Structural clustered summarization XSketch
(Neoklis 02) and TreeSketch (Neoklis 04) - Similar idea to clustered-based index
- XSketch uses forward and backward stability, and
TreeSketch uses count stability as similarity
measurement - Heuristics to reduce graph to fit memory budget
50Cardinality Estimation
- Decompression-based approach XSEED (Zhang 06)
- XML documents are compressed into a small kernel
with edge cardinality labels - Kernel can be decompressed into XML document with
cardinality annotations - Navigational path operator can be reused on the
decompressed XML document for cardinality
estimation
51Cost Modeling
- Statistical Learning Cost Model COMET (Zhang
05) - Relational operator cost modeling is performed by
analyzing the source code - XML operators are much more complex than
relational operators therefore analytical
approach is too time-consuming - Statistical learning approach needs a training
set of queries and learn the cost model from the
input parameters and real cost.
52What does XML Offer?
- Two major points raised by XML from data
management viewpoint - Schema last
- Complex network-oriented data model
53Schema Last
- Application categories
- Rigidly structured data
- Rigidly structured data with some text fields
- Semi-structured data (need to handle semantic
heterogeneity) - Text
- Very few examples of the 3rd category
- The 3rd category can be converted to 1 and 2.
54XML Data Model
- XML Records can be hierarchical as in IMS
- Have links as in CODASYL
- Have set-based attributes as in SDM
- Inherit from other records as in SDM
- And others that are known to be hard to implement
- Possible scenarios
- XMLSchema will fail
- A data-oriented subset of XMLSchema will be
proposed - Repeat the great debate
- Lessons
- L16Schema-last is probably a niche market
- L17 XQuery is pretty much OR SQL with a
different syntax - L18 XML will not solve the semantic
heterogeneity either inside or outside the
enpterprise
55Future of XML
- Likely a hot topic for many years for both data
exchanges and data integration - Likely will become a common playground for DB
and IR researchers (e.g., the INEX initiative
http//qmir.dcs.qmul.ac.uk/INEX/) - Many challenges to solve!
- Would XML converge to either relational DB search
or free text search?