Title: TECH854 XML TECHNOLOGIES XML Query Integrating XML and Databases
1TECH854XML TECHNOLOGIESXML QueryIntegrating
XML and Databases
- Gillian Miller
- gillian_at_ics.mq.edu.au
- Week 8 2002/2
2XML - Query and Databases
- SemiStructured Information
- XML Query
- XML Persistence
- XML Data design
- Native XML Databases
- XML enabling and relational databases
- Future directions
3Convergence Of Disparate Data Frameworks
Object Model
Relational Model
Document Model
Semi-Structured Data
XML
XML - allows representation of information from
previously disparate worlds - database centric
(everything is a relation) (everything is an
object), document centric XML alllows convergence
4Semi-structured data
Semi-Structured Data schemaless or
self-describing, data that may include multiple
occurrences or occurences that may be missing,
data that can be loosely typed
- Origins
- Integrating information from heterogeneous
sources - data with non-rigid structures
- biological data
- Web data
Ref Abiteboul et al - Data on the Web - 2000
5The Semistructured Data Model
Bib
o1
complex object
paper
paper
book
references
o12
o24
o29
references
references
author
page
author
year
author
title
http
title
title
publisher
author
author
author
o43
25
96
1997
last
firstname
atomic object
firstname
lastname
first
lastname
243
206
Serge
Abiteboul
Victor
122
133
Vianu
Object Exchange Model (OEM)
6Syntax for Semistructured Data
- Bib o1
- paper o12 ,
- book o24 ,
- paper o29
- author o52 Abiteboul,
- author o96 firstname 243
Victor, -
lastname o206 Vianu, - title o93 Regular path queries
with constraints, - references o12,
- references o24,
- pages o25
- first o64 122, last o92
133 -
-
Ref Suciu, CSE 590ds http//www.cs.washington.ed
u/ education/courses/cse590ds/01sp/
7Syntax for Semistructured Data
- omit oids -
- paper author Abiteboul,
- author firstname Victor,
- lastname
Vianu, - title Regular path queries
, - page first 122, last 133
-
-
8Comparison with Relational Data
- row name John, phone 3634 ,
- row name Sue, phone 6343 ,
- row name Dick, phone 6363
9Many different representations
r1
r2
row
row
row
row
row
row
row
row
r1
r2
r1
r2
a
c
b
a1
b1
c1
a
c
b
a1
b1
c1
r1
r2
10XML Similarities/Differenecs
- ltperson ido123gt
- ltnamegt Alan lt/namegt
- ltagegt 42 lt/agegt
- ltemailgt ab_at_com lt/emailgt
- lt/persongt
- person o123
- name Alan,
- age 42,
- email ab_at_com
ltperson fathero123gt lt/persongt
person father o123
similar on trees, different on graphs
11More XML Oids and References
- ltperson ido555gt ltnamegt Jane lt/namegt lt/persongt
- ltperson ido456gt ltnamegt Mary lt/namegt
- ltchildren
idrefo123 o555/gt - lt/persongt
- ltperson ido123 mothero456gtltnamegtJohnlt/namegt
- lt/persongt
Oids and references in XML are just syntax No
real difference logically between elements amd
attributes
12On Schemas and typing
- Schemas -
- XML schemas/DTDs have built in flexibility to
allow deviations - Use of schemas
- schema extraction - structure can be deduced
- schema inference
- if we have structure, should take advantage of it
- query opimization, data validation, type checking
13X Query
- New full powered query language for XML with
both document-centric and data-centric
capabilites - Expressive power
- Relational joins
- Navigation and hierarchy structure
- Compositionality (node sets)
- Reconstruction of new node sets
- Combining documents
- Filtering, sorting, functions
- see Maier, Database desiderata for query
languages
14XQuery
- XQuery combines data queries with document
queries - XQuery builds on XPath to traverse document
- XQuery is a functional language in which a query
is represented as an expression - XQuery expressions can be nested with full
generality - The input and output of an XQuery are instances
of the XML Query Data Model - Based on OQL, SQL, XML-QL, XPath
- Also - Lorel, XSQL, XSL, YATL ? Quilt
- ref XML Query Language - experiences and
exemplars
15XQuery Working Drafts
- Currently W3C Working Draft - recommendation in 6
months ? - The goal of W3C is to produce a data model for
XML documents, a set of query operators on that
data model, and a query language based on these
query operators - at least 8 Working Drafts
- Requirements - planning document
- Set of use cases - real world scenarios
- XQuery document - central document
- Data Model - Abstract representation
- Formal Semantics
- underlying algebra of operations
- Syntax
- More than one syntax XQueryX (XML syntax for
machines) - XPath documentation
- Basic functions and operators
16Example
Make an alphabetic list of publishers. Within
each publisher, make a list of books, each
containing a title and a price, in descending
order by price. ltpublisher_listgt FOR p IN
distinct(document("bib.xml")//publisher)
RETURN ltpublishergt ltnamegt
p/text() lt/namegt FOR b IN
document("bib.xml")//bookpublisher p
RETURN ltbookgt b/title b/price
lt/bookgt SORTBY(price DESCENDING)
lt/publishergt SORTBY(name)
lt/publisher_listgt
17XML Query Use Cases
- Use Cases
- Description, DTDs and Schema, Sample data,
Queries and Results - Current Use Cases
- "XMP" Experiences and Exemplars
- "TREE" Queries that preserve hierarchy
- "SEQ" - Queries based on Sequence
- "R" - Access to Relational Data
- "TEXT" Full-text Search
- "NS" - Queries Using Namespaces
- "PARTS" - Recursive Parts Explosion
- "REF" - Queries based on References
18Sample Data
19XML Query Data Model
- Describes XML as a tree of specialized nodes
- Uses a functional style notation (think Haskell)
- Node DocNode ElemNode
ValueNode
AttrNode NSNode
PINode CommentNode
InfoItemNode
RefNode
elemNode (QNameValue,
AttrNode , ElemNode
ValueNode) QNameValue means a tag
name ... means set of... ... means
list of ...
20XPath Data Model
- Building block for other W3C standards
- XSL Transformations (XSLT)
- XML Link (XLink)
- XML Pointer (XPointer)
- XML Query
The root
The root element
book
book
publisher
author
. . . .
Addison-Wesley
Serge Abiteboul
21XPath
- A path expression is a series of steps separated
by / - Can also sort and select nodes
22Xpath Revision
- bib matches a bib element
- matches any element
- / matches the root element
- /bib matches a bib element under root
- bib/paper matches a paper in bib
- bib//paper matches a paper in bib, at any depth
- //paper matches a paper at any depth
- paperbook matches a paper or a book
- _at_price matches a price attribute
- bib/book/_at_price matches price attribute in book,
in bib - bib/book/_at_pricelt55/author/lastname matches
23(No Transcript)
24XQuery FLWR Expressions
- A FLWR expression binds some expressions, applies
a predicate, and constructs a new result. - expr can contain FLWR expressions
- nested building blocks
FOR and LET clauses generate a list of tuples of
bound expressions, preserving document order.
WHERE clause applies a predicate, eliminating
some of the tuples
RETURN clause is executed for each surviving
tuple, generating an ordered list of outputs
25List the titles of books published by Morgan
Kaufmann in 1998. FOR b IN document("bib.xml")//
book WHERE b/publisher "Morgan Kaufmann
AND b/year "1998" RETURN b/title
List each publisher and the average price of its
books. FOR p IN distinct(document("bib.xml")//pu
blisher) LET a avg(document("bib.xml")//book
publisher p/price) RETURN ltpublishergt
ltnamegt p/text() lt/namegt ltavgpricegt
a lt/avgpricegt lt/publishergt
26Constructing Elements
ltbook isbn"isbn-0060229357"gt lttitlegtHarold
and the Purple Crayonlt/titlegt ltauthorgt
ltfirstgtCrockettlt/firstgt
ltlastgtJohnsonlt/lastgt lt/authorgt lt/bookgt
for i in //book RETURN ltexamplegt ltpgt Here is a
query. lt/pgt lteggt i//title lt/eggt ltpgt Here is
the result of the above query. lt/pgt lteggt i //
title lt/eggt ltsurnamegt i / author /
last(text() lt/surnamegt lt/examplegt
ltexamplegt ltpgt Here is a query. lt/pgt lteggt
i//title lt/eggt ltpgt Here is the result of the
above query. lt/pgt lteggtlttitlegtHarold and the
Purple Crayonlt/titlegtlt/eggt ltsurnamegtJohnsonlt/su
rnamegt lt/examplegt
27XQuery Expressions
- XQuery expressions
- Path expressions
- Element constructors
- FLWR expressions
- Expressions involving operators and functions
- Conditional expressions
- Quantified expressions
- List constructors
- Expressions that test or modify datatypes
28Combining Expressions
ltbibliographygt Expression lt/bibliographygt
ltbibliographygt FOR b IN Expression RETURN
ltbookgt Expression ,
Expression lt/bookgt SORTBY
(Expression , Expression ) lt/bibliographygt
ltbibliographygt FOR b IN
document("bib.xml")//book RETURN ltbookgt
b/author, b/title
lt/bookgt SORTBY (author,
title) lt/bibliographygt
29Functions
FOR p IN distinct(document("bib.xml")//publisher)
LET a avg(document("bib.xml") /book
publisher p /price) RETURN
ltpublishergt ltnamegt p/text()
lt/namegt ltavgpricegt a lt/avgpricegt
lt/publishergt
Q12 List each publisher and the average price
of its books.
30Conditional Expressions
FOR h IN //holding RETURN
ltholdinggt h/title,
IF h/_at_type "Journal" THEN
h/editor ELSE
h/author lt/holdinggt
SORTBY (title)
IF THEN ELSE construct (Q21) Make a list of
holdings, ordered by title. For journals,
include the editor, and for all other holdings,
include the author.
31Nesting
Usecase 4 For each author in the bibliography,
list the author's name and the titles of all
books by that author, grouped inside a "result"
element.
ltresultsgt for a in distinct-values(docume
nt("http//www.bn.com")//author) return
ltresultgt a for
b in document("http//www.bn.com")/bib/book
where some ba in b/author satisfies
deep-equal(ba,a) return
b/title lt/resultgt
lt/resultsgt
ltresultsgt ltresultgt ltauthorgt
ltlastgtStevenslt/lastgt
ltfirstgtW.lt/firstgt lt/authorgt
lttitlegtTCP/IP Illustratedlt/titlegt
lttitlegtAdvanced Programming in the Unix
environmentlt/titlegt lt/resultgt ltresultgt
ltauthorgt ltlastgtAbiteboullt/lastgt
ltfirstgtSergelt/firstgt
32FOR versus LET
- FOR
- Binds node variables ? iteration
- LET
- Binds collection variables ? one value
Returns ltresultgt ltbookgt...lt/bookgtlt/resultgt
ltresultgt ltbookgt...lt/bookgtlt/resultgt ltresultgt
ltbookgt...lt/bookgtlt/resultgt ...
FOR x IN document("bib.xml")/bib/book RETURN
ltresultgt x lt/resultgt
Returns ltresultgt ltbookgt...lt/bookgt
ltbookgt...lt/bookgt
ltbookgt...lt/bookgt ... lt/resultgt
LET x IN document("bib.xml")/bib/book RETURN
ltresultgt x lt/resultgt
33Collections in XQuery
- Ordered and unordered collections
- /bib/book/author an ordered collection
- Distinct(/bib/book/author) an unordered
collection - LET b /bib/book ? b is a collection
- b/author ? a collection (several authors...)
Returns ltresultgt ltauthorgt...lt/authorgt
ltauthorgt...lt/authorgt
ltauthorgt...lt/authorgt
... lt/resultgt
RETURN ltresultgt b/author lt/resultgt
34Collections in XQuery
- What about collections in expressions ?
- b/price ? list of n
prices - b/price 0.7 ? list of n numbers
- b/price b/quantity ? list of n x m numbers
?? - b/price (b/quant1 b/quant2) ? b/price
b/quant1 b/price b/quant2 !!
35Recursive Functions
NAMESPACE xsd"http//www.w3.org/2001/XMLSchema-
datatypes" FUNCTION depth(ELEMENT e) RETURNS
xsdinteger -- An empty element has depth
1 -- Otherwise, add 1 to max depth of
children IF empty(e/) THEN 1 ELSE
max(depth(e/)) 1 depth(document("partlist
.xml"))
(Q25) Find the maximum depth of the document
named "partlist.xml."
36XQuery Quantified Expressions
Existential quantifier (Q22) Find titles of books
in which both sailing and windsurfing are
mentioned in the same paragraph. FOR b IN
//book WHERE SOME p IN b//para SATISFIES
contains(p, "sailing") AND
contains(p, "windsurfing") RETURN b/title
Universal quantifier ((Q23) Find titles of books
in which sailing is mentioned in every
paragraph. FOR b IN //book WHERE
EVERY p IN b//para SATISFIES
contains(p, "sailing") RETURN b/title
37SQL - Expressive Power
- XQuery uses a for let where .. result
syntax for ? SQL from where ?
SQL where result ? SQL select let
allows temporary variables, and has no
equivalence in SQL - let binds to a set of nodes - groupby has no equivalence yet in XQuery
38SQL
ltresultgt for i in document("items.xml")//
item_tuple where i/start_date lt
current-date() and i/end_date gt
current-date() and contains(i/description,
"Bicycle") return ltitem_tuplegt
i/itemno i/description
lt/item_tuplegt sort by (itemno)
lt/resultgt
1.4.4.1 Q1 List the item number and description
of all bicycles that currently have an auction
in progress, ordered by item number.
39SQL Joins
ltresultgt for u in document("users.xml")/
/user_tuple for i in document("items.xml")//i
tem_tuple where u/rating gt "C" and
i/reserve_price gt 1000 and i/offered_by
u/userid return ltwarninggt
u/name u/rating
i/description
i/reserve_price lt/warninggt
lt/resultgt
1.4.4.3 Q3 Find cases where a user with a rating
worse (alphabetically, greater) than "C" is
offering an item with a reserve price of more
than 1000.
40XQueryX
- XQueryX is an XML representation of an XQuery
- It was created by mapping the productions of the
XQuery abstract syntax directly into XML
productions - XQueryX useful to enable
- Parser reuse
- Queries on queries
- Generation of queries
- Embedding of queries in XML documents
41XQueryX
Listing 2. The same query in XQueryX format
ltqquery xmlnsq"http//www.w3.org/2001/06/xquery
x"gt ltqflwrgt ltqletAssignment
variable"authors"gt ltqstep
axis"CHILD"gt ltqidentifier/gt
ltqstep axis"CHILD"gt
ltqidentifiergtbooklt/qidentifiergt
ltqidentifiergtauthorlt/qidentifiergt
lt/qstepgt lt/qstepgt
lt/qletAssignmentgt
ltqreturngt ltqelementConstructorgt
ltqtagNamegt
ltqidentifiergtAUTHORSlt/qidentifiergt
lt/qtagNamegt ltqvariablegtauthors
lt/qvariablegt lt/qelementConstructorgt
lt/qreturngt lt/qflwrgt lt/qquerygt
. A simple query in XQuery standard syntax
LET authors /book/author RETURN
ltAUTHORSgt authors
lt/AUTHORSgt
42XSQL
- XSQL is alternative language by relational
vendors (Oracle) - Based on well defined relational algebra and SQL
syntax - advantages of transfer at DBMS engineering
- Disadvantages - baggage of SQL
- Works well with XML that is relational in
structure - More contortions for less relational XML and
complicated transformations
43Some implemenations of XQuery
- Microsoft on-line demo
- http//xqueryservices.com
- Software AG QuiP
- http//www.softwareag.com/developer/downloads/defa
ult.htm - Fatdog
- http//www.fatdog.com
- SourceForge
- http//sourceforge.net/projects/xquench/
- Kweelt
- http//db.cis.upenn.edu/Kweelt/
44Is XML a Database ?
- XML is a collection of data
- XML is self-describing, portable and has rich
expressiveness to represent any tree or graph
structure - but verbose, needs parsing
- XML provides
- elementary storage (XML documents)
- schemas (DTD.s, XML Schema)
- query languages (XQuery, XPath)
- programming interfaces (SAX, DOM, JDOM)
45Need for DBMS
- XML Does not provide
- efficient storage
- indexes
- security
- transactions
- multi-user access
- queries across multi-user documents
- query optimisations
- For more comprehensive applications we will need
a database of some sort
46Data versus Documents
- Most important issue
- Is your information data-centric ?
- relational ? well structured ? typical business
type data, not too many nulls - eg sales order, product catalog
- Is your information document-centric
- semi-structured, textual content, mixed content,
complex hierarchies, content management , less
regular structure, larger grained data - eg docbook, legislation, university web site
content
But what about the product catalog complete with
brochures, advertising, images ???
47Application Use
- Data-centric
- transaction processing, e-commerce, critical
reconciliation or processing - Document-centric
- viewable information
- rich aggregated content
- archiving information
- For non-repudiation, an order should reflect all
of the information as the customer saw it, not as
it would be generated from a relational database
using a set of SQL queries - knowledge management, information dissemination
BUT Document-centric applications can involve
data-centric information !!!
48Storing XML
- Flat Files
- lightweight data storage
- OK for some applications (configuration files,
small in-house systems) - Relational Systems
- Universal Systems (relational true XML
extensions) - evolving to truer XML model
- Native XML Database
- Other categories
- middleware, XML Servers, XML Appplication Servers
(Zope, Cocoon), Content management (Documentum,
Vignette), caching systems
http//www.rpbourret.com/xml/XMLDatabaseProds.htm
49Native XML Databases
- Databases that store XML in "native" form
- indexed text or
- as some variant of the DOM mapped to an
underlying data store. - Different underlying engines
- Object Based
- Relational
- proprietary models
- Technology maturing, Commercial models expensive
- but less than universal databases - Examples
- Tamino
- eg Excelon (ObjectStore), Poet
- extensible Information Server (XIS)
- Open Source
- XIndice - Apache, propietary model
- XDB (Postgres relational based)
50(No Transcript)
51(No Transcript)
52Pure Relational Databases
- Convert your XML file to relational design
- For each complex element, create a table CE and
primary key - For each element with mixed content, create a
separate table to store PCData, with link back to
parent element using foreign key - For each single element and attribute, create a
column in table CE - Repeat for each complex child element, using
foreign keys to link back to parent
or do proper analysis using ERA !!
53Converting Relations to XML
- Attributes versus elements ???
- Containment versus Pointers
- Containment
- 11 and 1many best represented by containment
- eg invoice ( orderline)
- Pointers
- Many to many
- Keep elements separate - use IDs and IDrefs to
make mappings
54(No Transcript)
55Containment or Pointers
containment
pointers
56Containment and Pointers
57Storing XML in Relational Databases - CLOB
- Store as string (Character Large Object)
- E.g. store each top level element as a string
field of a tuple in a database - Use a separate relation for each top-level
element type - E.g. account, customer, depositor
- Indexing
- Store values of subelements/attributes as extra
indexes - Benefits
- Can store any XML data even without DTD
- As long as there are many top-level elements in a
document, strings are small compared to full
document, allowing faster access to individual
elements. - Drawback Need to parse strings to access values
inside the elements parsing is slow.
58Storing XML as Relations
- Tree representation model XML data as general
tree and store using relations - nodes(id, type, label, value)
- child (child-id, parent-id)
- Each element/attribute is given a unique
identifier - Type (element_or_attribute), labeltag,
valuecontent - child notes the parent-child relationships in
the tree - Can add an extra attribute to child to record
ordering of children - Benefit Can store any XML data, even without DTD
- Drawbacks
- Data is broken up into too many pieces,
increasing space overheads - Even simple queries require a large number of
joins, which can be slow -
59XML Enabled Databases
- FIRST WAVE
- Databases with extensions for transferring data
between XML documents and themselves - Useful for data-centric applications and legacy
databases - Any relational table can be rendered as
edge-labelled graph - This means a table can be expressed as XML using
either elements or attributes or both for each
data cell
60Generating XML from SQL queries
http//technet.oracle.com/tech/xml/info/htdocs/otn
wp/about_oracle_xml_products.htm
61(No Transcript)
62XML is really just a row view
SELECT EMPNO, ENAME FROM EMP WHERE EMPNO
7654 generates the following XML document
lt?xml version"1.0"?gt ltROWSETgt ltROW
id"1"gt ltEMPNOgt7654lt/EMPNOgt
ltENAMEgtMARTINlt/ENAMEgt lt/ROWgt lt/ROWSETgt
63General row to HTML table
Can write generalized XSLT even for this simple
row format
64Microsoft SQL Server
- Use FOR XML in SELECT statements
- RAW result is table with ROW elements
- AUTO elements are table name choose either
attributes or elements - EXPLICIT model using series of SELECT
statements Unioned together with tags to give
nesting hierarchy - XML View
- Wrapper describes shape of XML document
generated on the fly XDR schema maps to
relational schema - Then query XML View as though it were an XML
document
65XML Publishing SQL Server, RAW Mode
Select S.euSid, P.name, P.price From Stores S,
EuSales L, Products P Where S.euSid L.euSid AND
L.pid P.pid For XML Raw
ltrow euSid SLKDJFS, name Saint Emilion,
price23.99gt ltrow euSid DRJLKSD, name
Loire, price12.99gt . . . .
- flat XML
- default tag and attribute names
66XML Publishing SQL ServerAuto Mode
Select S.euSid, P.name, P.price From Stores S,
EuSales L, Products P Where S.euSid L.euSid AND
L.pid P.pid For XML Auto
ltStores euSid SLKDJFSgt ltProduct name
Saint Emilion, price23.99/gt ltProduct
name Loire, price12.99/gt lt/Storesgt ltStores
euSid FGJISODgt . . . . lt/Storesgt . . .
- nested XML
- default tag and attribute names
67XML Publishing SQL ServerExplicit Mapping
Note the complexity of generating a
simple hierarchy !
Wrox- Professional Databases p550
68Rowset from explicit Mode
69Mismatches between XML/RDBMS
- XML
- Data in single hierarchy
- Nodes have elements and/or attribute values
- Elements are nested
- Elements are ordered
- Schema optional
- Direct storage/retrieval of simple docs
- Query with XML standards
- RDBMS (normalized)
- Data in multiple tables
- Cells have single value
- Atomic cell values
- Row/column order not defined
- Schema required
- Joins necessary to rerieve simple docs
- Query with SQL retrofitted for XML
Michael Champion, Storing XML in Databases
But XML hierarchy does not address issues
of redundant information or different nestings -
the whole rationale for RDBMS !!
70XML Universal Databases
- Future trends
- Relational databases are being XML enabled to
become truer XML repositories - Moving beyond simple XML wrappers around
relations - and/or storing XML as LOB (binary, character)
- Trend towards unification of XML content and
relational data - New XML architectures and features (dynamic XML
Views, automatic mappings) - see for example - Seybold report, Oracle XML DB
Uniting XML Content and Data, 2002 - XTables - Bridging Relational Technologu and XML
- IBM 2002
71References
- Professional XML Databases - Kevin Williams et
al, Wrox, 2000 - W3C XML Query
- Data on the Web From Relations to Semistructured
data and XML -2000, Abiteboul et al - Rob Bourret - XML and Databases - as well as web
site - An introduction to XQuery, Katz, IBM Developer
Works - Oracle Technet site
- Other product sites including Tamino
- Suciu, CSE590ds - Management of XML and
semi-structured data - course - http//www.cs.washington.edu/education/courses/cse
590ds/01sp/