Title: XML ? Relational DB
1XML ? Relational DB
- Proposal by Richard Edwards
- Presented by Stephan Kepser
2Richard Edwards
- Computer Scientist
- School of Informatics
- University of Bangor, Wales
- rich_at_informatics.bangor.ac.uk
- http//www.informatics.bangor.ac.uk/rich/
- Source of this talk 2 PPT presentations by
Edwards.
3Aim
- To store arbitrary XML documents
- without recourse to underlying DTD or schema
- in a relational database
- in particular in a SQL database.
4In a Nutshell
- Store the Document Object Model of the XML
document. - Extend each node entry by two natural numbers
resembling the position of the node in the
document tree.
5Overview of Presentation
- XML and Databases
- Document object model (DOM)
- Storing the DOM in a relational database
- Adding navigational information
- Conclusions
6XML
- Extensible Markup Language
- Form of a class of documents may be given by
- a Document Type Definition (DTD)
- an XML Schema
- Model of a document A tree.
7Example
8Terminology
- Well-formed
- Conforms to the rules of XML
- Valid
- Document structure conforms to a specified
schema - Structure hierarchy of elements, attributes
content - DTD (SGML), XML Schema
9Categorising documents
- Data-centric
- Low-level atomic data
- Highly structured, e.g. a database extract
- Document-centric
- Semantic markup added to irregularly structured
or largely text-based information - These are loose categories
- Documents displaying the characteristics of both
are common, e.g. notes added to structured data.
10XML versus Databases
- An XML document is not a database.
- Database management systems offer
- Indexing (clustered and non-clustered)
- Constraints
- Foreign Keys
- Patterns
- Querying
- Multi-user access
11XML versus Databases
- Database management systems offer
- Security
- Transaction management
- Thread safety
- Crash recovery
- Cacheing
- Locking mechanisms
- Deadlock contention handling
12Document Object Model
- At the same time
- a model for arbitrary XML documents
- an application programming interface (API)
- for languages like Java, C, Python
- Base tree structure of a document
13DOM (2)
- A DOM is not the elements' tree of the document.
- Because an XML document contains much more than
elements - Attributes
- Comments
- Text nodes
- Notation
- Processing instructions
14DOM root
- A DOM is a tree.
- Its root is the document node.
- This is a new root node to attach the whole
document to. - Its daughters the top level objects of the
document. These are - the document's root element
- processing instructions
- lt?xml-stylesheet href"person.css"
type"text/css" ?gt
15DOM root (2)
- document type node
- lt!DOCTYPE person gt
- comment nodes
- lt!-- no comment --gt
16Node types
- Element node
- Attribute node
- Name and value of an attribute
- Document node
- Root of the DOM tree
- Document type node
- Root of the DTD
- Comment node
17Node types (2)
- Processing instruction node
- Text node
- Contains ordinary text of the document
- CDATA section node
- lt!CDATA blabla bla gt
- Document fragment node
- Root of a scrap board for editing,
- Only for processing, not part of a real
XML-document
18Node types (3)
- Entity node
- Definition of an entity in a DTD
- lt!ENTITY coord "((x, y)(y,x))"gt
- Entity reference node
- Optional reference of an entity, e.g., lt
- Notation node
- Definition of a notation in a DTD
- lt!NOTATION jpeg SYSTEM "image/jpeg"gt
19Node properties
- Node name
- Element or attribute name, or a fixed string
like comment - Node value
- Attribute value or the actual text of a text
node - Node type
- One of the node types above
20Node properties (2)
- Parent node
- List of child nodes
- First child
- Last child
- Previous sibling
- Next sibling
- Attributes set
21Leaf nodes
- Attribute node
- CDATA section node
- Comment node
- Entity node
- Entity reference node
- Notation node
- Processing instruction node
- Text node
22Example
- ltsample bogus"value"gt
- lttext_nodegtTest data.lt/text_nodegt
- lt/samplegt
23Example
24DOM
- The DOM contains a document's structure and ist
content. - It is a generic model, exists for any
XML-document. - It is independent of a DTD (may contain a DTD!)
or a schema.
25Persistent DOM
- Store DOM in relation database
- Translate the tree into a set of nodes
- Store node properties in tables.
26Database design
27Node type
28doc
29doc
30node
31node
32element_name_leaf
my_incredibly_verbosely_named_element
33text_leaf
34element_namespace_leaf
my_incredibly_verbose_nsmyelement
35More tables
- attribute
- node_id
- attribute_id
- attribute_namespace_leaf
- attribute_name_leaf
- attribute_value_leaf
36Yet more tables
- comment_leaf
- cdata_leaf
- entity_reference_leaf
- pi_data_leaf
- pi_target_leaf
37Database design repeated
38Labelling nodes
- Each node receives two natural numbers as
coordinates (x,y). - The tree is traversed depth-first.
- A counter is installed. On entering and leaving a
node it is incremented. - On entering a node, the node receives the current
counter value as x-coordinate. - On leaving a node, the node receives the current
counter value as y-coordinate.
39x 1 y ?
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
SAX event startDocument() DB action insert doc
(and return unique doc ID)
40x 2 y ?
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
SAX event startElement(nsURI, locName, qName,
attrs) DB action insert node, insert
element_name_leaf
41x 3 y ?
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
3
?
SAX event startElement(nsURI, locName, qName,
attrs) DB action insert node, insert
element_name_leaf
42x 4 y ?
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
3
?
c
4
?
SAX event startElement(nsURI, locName, qName,
attrs) DB action insert node, insert
element_name_leaf
43x 4 y 5
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
3
?
c
4
5
SAX event endElement(nsURI, locName, qName) DB
action update node (set y index)
44x 6 y ?
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
3
?
c
d
4
5
6
?
SAX event startElement(nsURI, locName, qName,
attrs) DB action insert node, insert
element_name_leaf
45x 6 y 7
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
3
?
c
d
4
5
6
7
SAX event endElement(nsURI, locName, qName) DB
action update node (set y index)
46x 3 y 8
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
3
8
c
d
4
5
6
7
SAX event endElement(nsURI, locName, qName) DB
action update node (set y index)
47x 9 y ?
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
e
3
8
9
?
c
d
4
5
6
7
SAX event startElement(nsURI, locName, qName,
attrs) DB action insert node, insert
element_name_leaf
48x 9 y 10
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
e
3
8
9
10
c
d
4
5
6
7
SAX event endElement(nsURI, locName, qName) DB
action update node (set y index)
49x 11 y ?
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
e
f
3
8
9
10
11
?
c
d
4
5
6
7
SAX event startElement(nsURI, locName, qName,
attrs) DB action insert node, insert
element_name_leaf
50x 12 y ?
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
e
f
3
8
9
10
11
?
c
d
g
?
12
4
5
6
7
SAX event startElement(nsURI, locName, qName,
attrs) DB action insert node, insert
element_name_leaf
51x 13 y 14
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
e
f
3
8
9
10
11
?
c
d
g
?
12
4
5
6
7
SAX event characters(char ch, int start, int
length) DB action insert node, insert text_leaf
13
14
52x 12 y 15
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
e
f
3
8
9
10
11
?
c
d
g
15
12
4
5
6
7
SAX event endElement(nsURI, locName, qName) DB
action update node (set y index)
13
14
53x 11 y 16
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
?
b
e
f
3
8
9
10
11
16
c
d
g
15
12
4
5
6
7
SAX event endElement(nsURI, locName, qName) DB
action update node (set y index)
13
14
54x 2 y 17
doc
1
?
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
17
b
e
f
3
8
9
10
11
16
c
d
g
15
12
4
5
6
7
SAX event endElement(nsURI, locName, qName) DB
action update node (set y index)
13
14
55x 1 y 18
doc
1
18
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
17
b
e
f
3
8
9
10
11
16
c
d
g
15
12
4
5
6
7
SAX event endDocument() DB action update doc
(set success flag) do extras
13
14
56doc
1
18
- ltagt
- ltbgt
- ltc/gt
- ltd/gt
- lt/bgt
- lte/gt
- ltfgt
- ltggtlt/ggt
- lt/fgt
- lt/agt
a
2
17
b
e
f
3
8
9
10
11
16
c
d
g
15
12
4
5
6
7
13
14
57Navigation
- The next sibling (x,y) of any given node (x, y)
has xy1 - The first child (x,y) of any given node (x, y)
has xx1 - The set of nodes that originate from a given node
(x, y) have xltxlty
58Node labelling
- Facilitates rapid serialisation (serial output)
- No need to traverse the tree
- Facilitates simple locking mechanism
- But
- Maintenance of node labels
- Trade-off between select speed and update
efficiency
59Querying
- XPath expressions have to be translated into
complicated select statements. - Subcomponents of an XPath expression are
translated into where-clauses or temporary tables
(embeded select). - Node coordinates help a lot, but do not do
everything.
60Querying (2)
- Example
- XPath //book
- SQL select from node as nd, element_node_leaf
- as enl where nd.owner_doc_id 2321 and
- nd.node_id enl.node_id and
- enl.leaf_text 'text_node'
61Querying (3)
- XPath //book/author
- SQL select from node as nd1, node as nd2,
element_node_leaf as enl1, element_node_leaf as
enl2 where - nd1.owner_doc_id 2321 and nd1.node_id
enl1.node_id and enl1.leaf_text 'author' and - nd2.owner_doc_id 2321 and nd2.node_id
enl2.node_id and enl2.leaf_text 'book' and - nd1.x_index nd2.x_index 1
62Querying (4)
- XSLT and XQuery require many select-statements
and additional computations outside of the
database. - XSLT and XQuery Turing-complete
- SQL II (Standard) First-order logic
- SQL III FOL least fixed points