Title: Introduction to XML Algebra
1Introduction to XML Algebra
2Data Model
- data model core data structures and data types
supported by DBMS - relational database is a table (set-oriented)
data model - XML format is a tree-structured hierarchical
model
3Why Query Algebra (for XML) ?
- It is common to translate a query language into
an algebra. - First, the algebra is used to give a semantics
for the query language. - Second, the algebra is used to support query
optimization.
4XML Algebra History
- Lore Algebra (August 1999)
- -- Stanford University
-
- IBM Algebra (September 1999)
- --Oracle IBM Microsoft Corp
- YAT Algebra (May 2000)
- ATT Algebra (June 2000)
- --ATT Bell Labs
- Niagara Algebra (2001)
- -- University of Wisconsin -Madison
5NIAGARA
- Title Following the paths of XML Data An
algebraic framework for XML query evaluation - By Leonidas Galanis, Efstratios Viglas, David
J. DeWitt, Jeffrey. F. Naughton, and David Maier.
- Univ. of Wisconsin
6Outline
- Concepts of Niagara Algebra
- Operations
- Optimization
7Goals of Niagara Algebra
- Be independent of schema information
- Query on both structure and content
- Generate simple, flexible, yet powerful algebraic
expressions - Allow re-use of traditional optimization
techniques
8Example XML Source Documents
Invoice.xml ltInvoice_Documentgt ltinvoice No
1gt ltaccount_numbergt2 lt/account_numbergt
ltcarriergtATTlt/carriergt lttotalgt0.25lt/totalgt
lt/invoicegt ltinvoicegt ltaccount_numbergt1
lt/account_numbergt ltcarriergtSprintlt/carriergt
lttotalgt1.20lt/totalgt lt/invoicegt
ltinvoicegt ltaccount_numbergt1 lt/account_numbergt
ltcarriergtATTlt/carriergt lttotalgt0.75lt/totalgt
lt/invoicegt lt/Invoice_Documentgt
- Customer.xml
- ltCustomer_Documentgt
- ltcustomergt
- ltaccountgt1 lt/accountgt
- ltnamegtTom lt/namegt
- lt/customer gt
- ltcustomergt
- ltaccountgt2 lt/accountgt
- ltnamegtGeorge lt/namegt
- lt/customer gt
- lt/Customer _Documentgt
9XML Data Model and Tree Graph
Invoice_Document
ltInvoice_Documentgt ltinvoicegt
ltnumbergt2lt/numbergt ltcarriergtSprintlt/carriergt
lttotalgt0.25lt/totalgt lt/invoicegt
ltinvoicegt ltnumbergt1lt/numbergt ltcarriergtSprintlt/car
riergt lttotalgt1.20lt/totalgt lt/invoicegt lt/Invoice
_Documentgt
Invoice
Invoice
number
carrier
number
total
total
carrier
2
ATT
0.25
1
1.20
Sprint
Ordered Tree Graph, Semi structured Data
10XML Data Model (for Querying)
- SQL relations in, relation out.
- Relational Algebra relations in, relation out.
- XQuery XML doc in, XML docs out
- XML Algebra ??
11XML Data Model GVDNM01
- Collection of bags of vertices.
- Vertices in a bag have no order.
- Example
Root invoice.xml invoice
invoice.account_number
lt account_number gt element-content lt/
account_number gt
ltinvoicegt Invoice-element-content lt/invoicegt
Rootinvoice.xml, invoice, invoice.
account_number
12Data Model
- Bag elements are reachable by path expressions.
- Path expression consists of two parts
- An entry point
- A relative forward part
- Example account_numberinvoice
13Outline
- Concepts of Niagara Algebra
- Operations
- Optimization
14Operators
- Source S , Follow ?, Expose ?, Vertex ?,
-
- Source S , Select ?, Join , Rename ?, Group
?, Union ?, Intersection ?, Difference - ,
Cartesian Product ?.
15 Source Operator S
- Input a list of documents
- Output a collection of singleton bags
- Examples
- S () All known XML documents
- S (invoice.xml) All XML documents
whose filename match -
invoice.xml - S (,schema.dtd) All known XML
documents that conform - to
schema.dtd
16Follow operator ?
- Input a path expression in entry point notation
- Functionality extracts vertices reachable by
path expression - Output a new bag that consists of the extracted
vertex all contents of original bag (in case of
unnesting follow)
17Follow operator (Example)
Root invoice.xml , invoice, invoice.carrier
Root invoice.xml invoice
invoice.carrier
ltcarriergt carrier -element-content lt/carrier gt
ltinvoicegt Invoice-element-content lt/invoicegt
Unnesting Follow
?(carrierinvoice)
Root invoice.xml invoice
ltinvoicegt Invoice-element-content lt/invoicegt
Root invoice.xml , invoice
18Select operator ?
- Input a set of bags
- Functionality filters the bags of a collection
using a predicate - Output a set of bags that conform to the
predicate - Predicate Logical operator (?,?,?), or simple
qualifications (?,?,?,?,?,?)
19Select operator (Example)
Root invoice.xml , invoice,
Root invoice.xml invoice
ltinvoicegt Invoice-element-content lt/invoicegt
? invoice.carrier Sprint
Root invoice.xml invoice
Root invoice.xml invoice
ltinvoicegt Invoice-element-content lt/invoicegt
ltinvoicegt Invoice-element-content lt/invoicegt
Root invoice.xml , invoice, Root invoice.xml
, invoice,
20Join operator
- Input two collections of bags
- Functionality Joins the two collections based on
a predicate - Output the concatenation of pairs of pages that
satisfy the predicate
21Join operator (Example)
Root invoice.xml , invoice, Root customer.xml ,
customer
Root invoice.xml invoice
Root customer.xml customer
ltinvoicegt Invoice-element-content lt/invoicegt
ltcustomergt customer-element-content lt/customergt
account_number invoice numbercustomer
Root invoice.xml invoice
Root customer.xml customer
ltinvoicegt Invoice-element-content lt/invoicegt
ltcustomergt customer-element-content lt/customergt
Root invoice.xml , invoice
Root customer.xml , customer
22Expose operator ?
- Input a list of path expressions of vertices to
be exposed - Output a set of bags that contains vertices in
the parameter list with the same order
23Expose operator (Example)
Root invoice.xml , invoice.bill_period,
invoice.carrier
Root invoice.xml invoice.
bill_period invoice.carrier
ltcarriergt bill_period -element-content lt/carrier gt
ltinvoicegt carrier-element-content lt/invoicegt
?(bill_period,carrier)
Root invoice.xml invoice
invoice.carrier invoice.bill_period
ltcarriergt bill_period -element-content lt/carrier gt
ltinvoicegt Invoice-element-content lt/invoicegt
ltinvoicegt carrier-element-content lt/invoicegt
Root invoice.xml , invoice, invoice.carrier,
invoice.bill_period
24Vertex operator ?
- Creates the actual XML vertex that will encompass
everything created by an expose operator - Example
? (Customer_invoice)?(?(account)invoice.account_
number, ?(inv_total)invoice.total)
25Other operators
- Group ? is used for arbitrary grouping of
elements based on their values - Aggregate functions can be used with the group
operator (i.e. average) - Rename ? Changes entry point annotation of
elements of a bag. - Example ?(invoice.bill_period,date)
26Example XML Source Documents
Invoice.xml ltInvoice_Documentgt
ltinvoicegt ltaccount_numbergt2 lt/account_numbergt
ltcarriergtATTlt/carriergt lttotalgt0.25lt/totalgt
lt/invoicegt ltinvoicegt ltaccount_numbergt1
lt/account_numbergt ltcarriergtSprintlt/carriergt
lttotalgt1.20lt/totalgt lt/invoicegt
ltinvoicegt ltaccount_numbergt1 lt/account_numbergt
lttotalgt0.75lt/totalgt lt/invoicegt ltauditorgt
maria lt/auditorgt lt/Invoice_Documentgt
Customer.xml ltCustomer_Documentgt
ltcustomergt ltaccountgt1 lt/accountgt ltnamegtTom
lt/namegt lt/customer gt ltcustomergt ltaccountgt
2 lt/accountgt ltnamegtGeorge lt/namegt
lt/customer gt lt/Customer _Documentgt
27Xquery Example
- List account number, customer name, and invoice
total for all invoices that have carrier
Sprint.
- FOR i in (invoices.xml)//invoice,
- c in (customers.xml)//customer
- WHERE i/carrier Sprint and
- i/account_number c/account
- RETURN
- ltSprint_invoicesgt
- i/account_number,
- c/name,
- i/total
- lt/Sprint_invoicesgt
28Example Xquery output
- ltSprint_Invoicegt
- ltaccount_numbergt1 lt/account_numbergt
- ltnamegtTom lt/namegt
- lttotalgt1.20lt/totalgt
- lt/Sprint_Invoice gt
29Algebra Tree Execution
Account_number name total
Expose (.account_number , .name, .total )
invoice(2) customer(1)
Join (.invoice.account_number.customer.account)
invoice (2)
Select (carrier Sprint )
customer (2)
customer(1)
Invoice (1)
invoice (2)
invoice (3)
Follow (.invoice)
Follow (.customer)
Source (Invoices.xml)
Source (cutomers.xml)
30Outline
- Concepts of Niagara Algebra
- Operations
- Optimization
31Optimization with Niagara
- Optimizer based on Niagara algebra
- Use the operation more efficiently
- Produce simpler expressions by combining
operations -
32Language Convention
- A and B are path expressions
- Alt B --? Path Expression A is prefix of B
- AnB ---? Common prefix of path A and B
- AnB ---? Greatest common prefix
- of path A and B
- - ---? Null path Expression
33Heuristics using Rewrite Rules
-
- Allow optimization based on path selectivity
- When applying un-nesting with operation Fµ
34Interchangeability of Follow operation
- Fµ(A) Fµ(B)Fµ (B)Fµ (A)
- TRUE or FALSE?
- TRUE when
- exists C such that C lt A C lt B and C AnB
- Or AnB -
35Application of Rule on Invoice
- Fµ(acc_Numinvoice)Fµ(carrierinvoice)
-
- Fµ(carrierinvoice)Fµ(acc_Numinvoice) ?
- TRUE or FALSE?
36Application of Rule on Invoice
- Fµ(acc_Numinvoice)Fµ(carrierinvoice)
-
- Fµ(carrierinvoice)Fµ(acc_Numinvoice)
- TRUE because both share common prefix invoice.
- Case AnB invoice
37Benefit of Rule Application
- NOTE Assume acc_Num is required for each invoice
element, while carrier is not - THEN
- Fµ(acc_Numinvoice)Fµ(carrierinvoice)
-
- Fµ(carrierinvoice)Fµ(acc_Numinvoice)
- Then what algebra tree do we prefer?
38Discussion
- Reduction of Input Size on first
- Sub-operation
-
- Fµ(carrierinvoice) ?
- vs
- Fµ(acc_Numinvoice) (
39Can we apply the rule below?
- Fµ(acc_Numinvoice)Fµ(acc_NumCustomer)
40Example
- acc_Numinvoice and
- acc_Numcustomer
- are two totally different paths
- Case is AnB -
- So yes, rule is valid.
41Summary
- XML Algebra
- Operations
- Optimization