Title: XML Publishing: Bridging Theory and Practice
1XML PublishingBridging Theory and Practice
- Wenfei Fan
- University of Edinburgh
- and
- Bell Laboratories
2XML documents
- Rooted, node-labeled, ordered, unranked tree
- element e.g., course, prereq tagged, subtree,
- subelement, e.g., the prereq child of course
- text node, e.g., CS650, carrying text, not
tagged, leaf
3XML publishing data exchange on the Web
RDB
XML view
view mapping
source
- Most legacy data is stored in relational
databases - XML has become the prime standard for data
exchange
Web
XML
XML
Q XML view
publishing
DB1
DB2
4XML publishing an XML interface of databases
query
answer
DTD
XML
publishing
query translation
middleware
DBMS
RDB
Querying and updating traditional databases via
XML views
5Example XML publishing
Registrar DB
XML view
R
- Relational schema R0
- course (cno, title, type)
- prereq (cno1, cno2) -- prerequisite
hierarchy - XML DTD D0
- db ? course
- course ? cno, title, type, prereq
- prereq ? cno, prereq
- type ? regular project
6XML publishing languages in practice
- XML view definition languages XML views
published from RDB - Commercial products
- Microsoft SQL Server 2005 (FOR-XML, XSD)
- IBM DB2 XML Extender (SQL/XML, DAD SQL, RDB)
- Oracle 10g XML DB (SQL/XML, DBMS_XMLGEN)
- Research prototypes
- XPERANTO
- TreeQL (SilkRoute)
- ATG (PRATA)
7XML publishing in practice
RDB
XML view
view mapping
source
Top-down from the root, via embedded relational
queries
relationalquery
RDB
db
Q
...
course
course
course
Q1
course
cno
type
title
prereq
Q2
...
regular
Web DB
CS650
cno
cno
prereq
8XML publishing question of the users
- What language should a user choose to express the
view? - unbounded depth, nondeterministic shape, cannot
be decided statically at compile time - prereq ? cno, prereq
- type ? regular project
db
...
course
course
course
course
collection
type
cno
title
X
X
regular
Web DB
CS650
project
unbounded
Few publishing languages can define this view
9XML publishing question of database vendors
- XML view under each course, list all its
prerequisites, direct or not - collapsing prerequisite hierarchy
- a tree of depth three
- Question is it necessary to upgrade DBMS and
support SQL99?
db
Q
...
course
Q1
course
course
course
...
type
cno
cno
title
cno
CS650
Web DB
project
The expressive power and complexity of XML
publishing languages
10Outline
- XML publishing transducers
- Characterization of XML publishing languages in
practice - Complexity evaluation cost, static analyses
- Expressive power tree generation, relational
characterization - Dynamic aspect incremental XML publishing, view
updates - Open research issues
- Joint work with
- Theory Floris Geerts, Frank Neven PODS07
- System Michael Benedikt, Phil Bohannon, Cheeyong
Chan, Rajeev Rastogi, SIGMOD03,04
VLDB02,04,05 ICDE07
11Outline
- XML publishing transducers
- Characterization of XML publishing languages in
practice - Complexity evaluation cost, static analyses
- Expressive power tree generation, relational
characterization - Dynamic aspect incremental XML publishing, view
updates - Open research issues
12XML publishing transducers
- ? (Q, ?, q0, ?) for a relational schema R
- Q a finite set of states
- ? a finite alphabet of XML tags, with a root r
and text - q0 the start state
- ? for each pair (q, a) in Q ? ?
- (q, a) ? (q1, a1, ?1(x1, y1)), . . ., (qk,
ak, ?k(xk, yk)), - to generate the children of a nodes a1, . . .,
ak - register Rega set-valued, fixed arity, with each
a-node - ?i query R ? Rega ? Regai in a relational query
language L - xi a list of free variables in ?i, grouping
attributes - deterministic
- (q, text) ? . -- Empty RHS text nodes have no
children
13Top-down transduction
- Start rule ?(q0, r) -- q0, r0 do not appear on
the RHS of any rule - (q0, db) ? (q, course, ?1(c, t ?))
- ?1(c, t nil) ? t course(c, t, t)
- recall course (cno, title, type)
- tuple register Regc group the result by all
attributes - for each distinct tuple tp in the result of
?1(x ?) - create a course element
- carry the tuple tp in Regc
- expand at leaf nodes
x (c, t) y ?
(q0, db)
(q, a) labeled carrying Reg
...
(q, course)
(q, course)
(q, course)
(q, course)
Regc
Regc
Regc
Regc
14Registers tuple vs. relation
- (q, course) ? (q, cno, ?2(c ?)), (q,type,?3(t
?)), (q, prereq, ?4(? c)) - ?2(c ?) ? t Regc(c, t)
- ?4(? c) ? t, c (Regc(c, t) ? prereq(c,
c)) - recall prereq(cno1, cno2)
- tuple registers Regcno, Regt
- relation register Regp x ?, the result of
?4(? c) is a set - top down information passing the parent register
Regc in ?4(? c)
x ? y ( c )
(q0, db)
...
(q, course)
(q, course)
(q, course)
(q, course)
Regc
Regc
Regc
Regc
(q, type)
(q, cno)
(q, prereq)
Regcno
Regt
Regp
15Recursive transducer and stop condition
- (q, prereq) ? (q, cno, ?5(c ?)), (q, prereq,
?5(? c)) - ?5(? c) ? t, c (Regp(c, t) ? prereq(c,
c)) - Stop conditions
- ?5(? c) returns an empty set
- the RHS of ?(q, a) is empty (e.g., for text
nodes) - there is an ancestor node with the same label,
tag and register - No new information can be added to the tree
relation Reg
tuple Reg
(q0, db)
(q, course)
(q, prereq)
Regp
(q, a)
Rega
...
(q, cno)
(q, prereq)
(q, cno)
Regp
Regcno
Regcno
(q, a)
Rega
STOP
16Transformation of a publishing transducer ?
- terminates on a DB of R if all leaf nodes satisfy
a stop condition - ?(DB) XML tree, by striking out states and
registers - ?(R) the set of XML trees generated by ? for all
DB of R
17publishing transducers with virtual nodes
- ? (Q, ?, ?a, q0, ?)
- ?a a subset of ?, virtual tags
- Recall the view under each course, list all its
prerequisites - ? (Q, ?, ?a prereq, q0, ?)
Virtual nodes are removed from the output
18Various classes of publishing transducers
- PT(L, S, O)
- L the relational query language (CQ, FO, FP,
with and ?) - S register, relation vs. tuple (a special case
of relation Reg) - O output nodes, normal vs. virtual
- PTnr(L, S, O) non-recursive subset of PT(L, S,
O) - Example
- View 1 PT(CQ, relation, normal)
- View 2 PT(CQ, relation, virtual) and PTnr(FP,
tuple, normal)
- As opposed to query automata
- take a relational database as input, rather than
an existing tree - output a new tree, rather than accepting a tree
or selecting nodes
- In contrast to recent work on schema mapping
- relations to XML, not relation-to-relation or
XML-to-XML - via embedded relational queries, not
source-to-target constraints
19Outline
- XML publishing transducers
- Characterization of XML publishing languages in
practice - Complexity evaluation cost, static analyses
- Expressive power tree generation, relational
characterization - Dynamic aspect incremental XML publishing, view
updates - Open research issues
20Existing XML publishing languages
- Extensions of SQL by incorporating XML publishing
functions - Microsoft SQL Server 2005 (FOR-XML)
- IBM DB2 XML Extender (SQL/XML)
- Oracle 10g XML DB (SQL/XML, DBMS_XMLGEN)
- XPERANTO
-
- Annotating schema or fixed tree template with
relational queries - Microsoft SQL Server 2005 (XSD)
- IBM DB2 XML Extender (DAD SQL, RDB)
- TreeQL (SilkRoute)
- ATG (PRATA)
- . . .
21Extensions of SQL for XML publishing
- SQL/XML XMLElement, XMLForest, XMLAgg,
XMLConcat, - SELECT XMLELEMENT NAMEcourse,
- XMLFOREST c.cno AS cno, c.title AS
title - FROM course c
db
...
course
course
course
course
title
cno
- PTnr(FO, tuple, normal) no recursion, virtual
nodes - XPERANTO PTnr(FO, tuple, normal)
- Microsoft SQL Server 2005 (FOR-XML) PTnr(FO,
tuple, normal) - Oracle 10g XML DB
- DBMS_XMLGEN PT(FP, tuple, normal) (connect-by of
SQL99)
22Annotating schema or tree template
- ATG of PRATA DTD-directed view definition,
inherited attributes - prereq ? cno, prereq
- cno ? Q(prereq_p), prereq_c
Q(prereq_p) / semantic rules / - Q SELECT cno2 FROM prereq p,
prereq_p p - WHERE p.cno1 prereq_p.cno
- prereq_p parent attribute (relation register)
prereq
...
prereq
cno
cno
- PT(FO, relation, virtual) recursive views,
virtual nodes, DTD-conformance - Microsoft SQL Server 2005 (XSD) PTnr(CQ, tuple,
normal) - IBM DB2 XML Extender DAD-SQL PTnr(CQ, tuple,
normal), - DAD-RDB PTnr(CQ, tuple, normal)
- TreeQL (SilkRoute) PTnr(CQ, tuple, virtual)
23Putting these together
Microsoft SQL Server 2005 FOR XML PTnr(FO, tuple, normal)
annotated XSD PTnr(CQ, tuple, normal)
IBM DB2 XML Extender SQL/XML PTnr(FO, tuple, normal)
DAD-SQL PTnr(FO, tuple, normal)
DAD-RDB PTnr(CQ, tuple, normal)
Oracle 10g XML DB SQL/XML PTnr(FO, tuple, normal)
DBMS_XMLGEN PT(FP, tuple, normal)
XPERANTO PTnr(FO, tuple, normal)
SilkRoute TreeQL PTnr(CQ, tuple, virtual)
PRATA ATG PT(FO, relation, virtual)
24Outline
- XML publishing transducers
- Characterization of XML publishing languages in
practice - Complexity evaluation cost, static analyses
- Expressive power tree generation, relational
characterization - Dynamic aspect incremental XML publishing, view
updates - Open research issues
25Termination and evaluation cost
- Given a publishing transducer ? defined for a
relational schema R, - does the transformation of ? on DB terminate on
all DB of R? - how expensive is it to compute ?(DB)?
- ?(DB) is always defined on any instance DB of R.
- Worst-case data complexity
- EXPTIME if ? is in PT(L, tuple, O)
- 2EXPTIME if ? is in PT(L, relation, O)
- PTIME if ? is in PTnr(L, S, O)
- Tight bounds DAG ? tree, n-digit binary counter
- L and O have no impact on the worst-case data
complexity
26Static analyses
- For a class PT(L, tuple, O) of publishing
transducers, - The emptiness problem given ? in PT(L, tuple,
O), can ? generate a nontrivial XML tree? - Does the publishing transducer make sense?
- The membership problem given an XML tree T and
transducer ? in PT(L, tuple, O), can ? generate T
with some DB? - Can ? generate XML views that the user wants?
- The equivalence problem given ?1, ?2 in PT(L,
tuple, O) on the same relational schema R, do ?1
and ?2 generate the same XML views over all
instances of R? - Optimization Can ?1 be replaced by a more
efficient ?2?
27Matching complexity bounds for static analyses
- PT(L, S, O) when L is either FO or FP beyond
reach - emptiness, membership and equivalence
undecidable
- PT(CQ, S, O) slightly better
- Emptiness
- PTIME if O is normal
- NP-complete if O is virtual
- Membership
- ?2p-complete for PT(CQ, tuple, normal)
- undecidable if S is relation or O is virtual
- Reduction from (a) the satisfiability problem for
FO queries, and (b) the emptiness problem for
2-head DFA - Equivalence undecidable
- Reduction from the halting problem for 2RMs
28Complexity bounds for non-recursive transducers
- PTnr(FO, S, O) all three problems remain
undecidable
- PTnr(CQ, S, O) make our lives easier
- Emptiness the same as PT(CQ, S, O)
- Membership (S is tuple)
- PTnr(CQ, tuple, normal) ?2p-complete no better
- PTnr(CQ, tuple, virtual) undecidable ?
?2p-complete - Establish the small model property
- Equivalence
- PTnr(CQ, tuple, O) undecidable ? ?3p-complete
- Lower bound reduction from ???3SAT
- Upper bound a constructive proof
29Summary complexity bounds
fragments Equivalence Emptiness Membership
PT(FP, S, O) undecidable undecidable undecidable
PT(FO, S, O) undecidable undecidable undecidable
PT(CQ, tuple, normal) undecidable PTIME ?2p-complete
PT(CQ, relation, normal) undecidable PTIME undecidable
PT(CQ, S, virtual) undecidable NP-complete ?2p-complete
PTnr(FO, O, S) undecidable undecidable undecidable
PTnr(CQ, tuple, normal) ?3p-complete PTIME undecidable
PTnr(CQ, tuple, virtual) ?3p-complete NP-complete ?2p-complete
30Outline
- XML publishing transducers
- Characterization of XML publishing languages in
practice - Complexity evaluation cost, static analyses
- Expressive power tree generation, relational
characterization - Dynamic aspect incremental XML publishing, view
updates - Open research issues
31Containment relation
PT(FP, relation, virtual) PT(FO, relation,
virtual)
PT(FP, tup, virt)
PT(CQ, rel, virt)
PT(FP, rel, nm)
PT(FP, tup, nm)
PT(FO, rel, nm)
PT(FO, tup, virt)
PT(FO, tup, nm)
PT(CQ, rel, nm)
PTnr(FO, tup, nm)
PT(CQ, tup, virt)
PT(CQ, tup, nm)
PTnr(CQ, tup, virt)
PTnr(CQ, tup, nm)
XML view under each course, list all its
prerequisites, direct or not No need to upgrade
DBMS and support SQL99
32Compared to logical transduction
- (?dom(x), ?root(x), ?edge(xy), ?lt(xy),
?fc(xy), ?ns(xy), ?a(x)) - domain, root, edge, order, first-child,
next-sibling, label - define DAGs, unfold into a tree
- FO-transductions, SO-transduction (fixed
k-arity), PTIME FO-transductions,
PSPACE-SO-transductions
- Publishing transducers vs. logical transductions
- L-transductions ? PT(L, tuple, virtual)
- strict for FO
- PSPACE-SO-transductions ? PT(FP, relation,
virtual) (ordered) - PTIME-FO-transductions ? PT(FO, relation,
virtual) (ordered) - fixed-depth L-transductions PTnr(L, tuple, O)
(unordered tree) - PTnr(L, tuple, O) ? fixed-depth L-transductions
(L FP, FO) - No need to code stop conditions
33DTD and specialized DTD
- DTD D (?, r, ?), ? a ? ? for each a ? ?
- normalized ? a1, , ak a1 ak a,
Specialized DTD D (?, D, g), D a DTD, g ?
? ? - T conforms to D there is T s.t. T g(T) and
T conforms to D - Captures MSO definable trees and regular trees
- Capturing (specialized) DTD
- specialized DTDs are definable in PT(FO, tuple,
virtual) - normalized DTDs are definable in PT(FO, tuple,
normal) - there are normalized DTDs not definable in PT(CQ,
S, O) - Check each a ? ? in FO, return a default in the
presence of violation
DTD-directed publishing All members of a
community (or industry) agree on a DTD and then
exchange data w.r.t. the predefined DTD
34publishing transducer as a relational query
- Input ? (Q, ?, q0, ?) for R, an output tag
o ? ?, a DB of R - Output the union of Rego(v) for all v in the
tree generated
db
relational query
...
Q1
course
course
course
course
RDB
cno
type
title
prereq
Q2
Reg
...
regular
Reg
cno
cno
prereq
output
35Containment hierarchy as relational queries
Flattened PT(L, S, virtual) PT(L, S,normal)
PT(FP, relation, O) PT(FO, relation, O)
PT(FO, rel, O)
PT(FP, tup, O)
PT(CQ, rel, O)
PT(FO, tup, O)
not strict if NLOGSPACE PTIME
PT(CQ, tup, O)
PTnr(FO, tuple, O)
PTnr(CQ, tuple, O)
36complexity classes and relational query languages
- PT(FO, relation, O) captures PSPACE (ordered or
unordered) - Recognition problem can be determined using
PSPACE TM - Simulate partial fixpoint query and define a
total order - PT(FP, tuple, O) captures FP and thus PTIME
(ordered) - PT(FO, tuple, O)
- captures TC0FO and thus NLOGSPACE (ordered)
- ? TC0FO (unordered)
- Simulate transitive closure logic and vice versa
- PT(CQ, relation, O) contains deterministic
datalog - PT(CQ, tuple, O) captures linear datalog
- datalog p(x) ? p1(x1), , pk(xk)
- deterministic each p(x) has only one rule
- linear at most one pj is an IDB
37non-recursive classes as relational query
languages
- PTnr(FO, tuple, O) captures FO (ordered or
unordered) - PTnr(CQ, tuple, O) captures UCQ (ordered or
unordered) - Simulate union of conjunctive queries and vice
versa - Those corresponding to existing XML publishing
languages - PTnr(FO, tuple, O) SQL/XML, FOR-XML (Microsoft),
IBM DAD (SQL), - PTnr(CQ, tuple, O) XSD (Microsoft), TreeQL
38Expressiveness as relational queries
fragments Complexity/language
PT(FP, relation, O) PSPACE
PT(FO, relation, O) PSPACE
PT(FP, tuple, O) FP, PTIME (ordered databases)
PT(FO, tuple, O) TC0FO, NLOGSPACE (ordered databases)
PT(CQ, relation, O) ? deterministic datalog
PT(CQ, tuple, O) TC0CQ, linear datalog
PTnr(FO, tuple, O) FO
PTnr(CQ, tuple, O) UCQ
PT(L, S, virtual) PT(L, S, normal)
39Outline
- XML publishing transducers
- Characterization of XML publishing languages in
practice - Complexity evaluation cost, static analyses
- Expressive power tree generation, relational
characterization - Dynamic aspect incremental XML publishing, view
updates - Open research issues
40Incremental publishing
- Input
- a publishing transducer ? for relational schema R
- an instance DB of R
- XML view T ?(DB)
- relational updates ?DB
- Output XML updates ?T such that T ?T ? (DB
?DB) - Commercial products limited support
XML
?T
publishing
DBMS
middleware
incremental updates
RDB
?DB
41Why incremental update?
DB
XML publishing
source database
cached T
- Batch computation recompute the entire XML tree
from scratch - large XML views may take several hours to
produce! - Incremental computation compute XML change ? T
- Idea the new view T the old view T ? T
- Typically more efficient to compute ? T (small)
and update the old view T with ? T - Why? the new view T often differs slight from
the old view T reuse partial results computed
earlier
42Reduction Approach
- Most XML middleware takes a reduction approach
- treat Relational Database Systems (DBMS) as a
black box, - re-use as much functionality of DBMS as possible
- Why not the reduction approach for incremental
updates? - XML views are recursive
- Few systems support WITHRECURSIVE (linear
recursion) - Fewer support its use in views
- None supports incremental update of recursive
views (many algorithms are known for incremental
updates of recursive views, but unfortunately not
in practice) - The lowest common denominator of functionality of
DBMS -- no need for (recursive) view-update
support
43Sub-Tree Property
report
...
patient
patient
patient
patient
policy
treatment
name
SSN
Cheney
234
44Storing and updating XML a DAG representation
- Storing each XML sub-tree only once, at any level
of granularity - Associate an ID with each node in the tree
(Skolem function) - Small, unique value derived from the nodes
register - A hash table H to map from (q, type, ID) to a
node in the graph - Sub-tree pool each node has a reference count
and a children list (q1, type1, ID1), (q2,
type2, ID2), - XML update ?T (E, E-) of edges
((q1,type1,ID1), (q2,type2,ID2)) - E- remove (q2, type2, ID2) from the child list
of (q1,type1,ID1) and decrement reference count
on (q2, type2, ID2) - E insert (q2, type2, ID2) in the child list of
(q1, type1, ID1) and increment reference count on
(q2, type2, ID2) - Nodes with 0 reference counts move to sub-tree
pool to be reused
H
(tname, chemo), (inTreatment, iT23)
(treatment, t123)
(inTreatment, iT234)
(treatment, t345), (treatment, t567),
45Computing XML changes
- Computing XML changes ? T from database changes
? DB by incrementalizing SQL queries in a
transducer - select IP, P.tname2
- from ? Procedure P, inTreatment IP
- where P.tname1 IP
- Cuts (deletions) given ? DB, deletions of the
existing edges of T are determined by executing a
fixed number of non-recursive ? SQL queries
no recursion is involved (sub-tree property) - Buds (new sub-tree generation) top-down
iteration, evaluating non-recursive ? SQL queries
at each step - Each new sub-tree is computed at most once, by
sub-tree reusing (sub-tree pool) minimizing
recomputations - Partial results are complete up to a certain
level at each step, allowing lazy evaluation and
parallel processing
46Steps to Bud-Cut
1. For a set of database changes, ?DB, execute a
fixed number of non-recursive queries which
determine direct edge changes, E-, E
report
patient
patient
patient
patient
policy
treatment
name
policy
treatment
name
SSN
SSN
Cheney
234
inTreatment
tname
inTreatment
tname
Bush
123
treatment
treatment
47The XML view update problem
- Input
- a publishing transducer ? for relational schema R
- an instance DB of R
- XML view T ?(DB)
- XML updates ?T
- Output relational updates ?DB such that T ?T
? (DB ?DB) - Commercial systems limited support, already hard
for relational views
XML
?T
publishing
DBMS
middleware
view updates
RDB
?DB
48New challenges introduced by XML view updates
- Revising the semantics of side effects
- ?T delete coursecnoCS650//coursecnoCS4
50/prereq/ - Subtree property remove the prerequisites of all
CS450 occurrences? - DTD validation (if any)
- recursively defined
- XML views
- XML updates
db
...
course
course
course
cno
prereq
...
?
CS650
course
X
49Processing XML view updates
Deriving relational views V from XML views
(edge relations of DAG external storage)
XML
?T
1. DTD validation reject ? T if violation
relational views V
2. Computing view updates ? V from ? T
3. Computing updates ? DB from ? V May not exist
reject ? T if not
?V
4. Update the underlying DB and view V with ? DB
from ? V
?DB
DB
- Main challenges relational view updates
- Hard deciding view updatability is
intractable/undecidable - Open complexity, algorithm, commercial system
support
50Outline
- XML publishing transducers
- Characterization of XML publishing languages in
practice - Complexity evaluation cost, static analyses
- Expressive power tree generation, relational
characterization - Dynamic aspect incremental XML publishing, view
updates - Open research issues
51XML integration complexity and expressiveness
DTD
DB
DB
integration
DB
constraints
multiple, distributed sources
- XML integration transducers
- Two-way vs. top-down context-dependent
generation - Integrity constraints conformance to XML schema
- Information preservation data migration
- XML integration language Attribute Integration
Grammar (AIG)
52XML shredding
query
answer
XML
shredding
query translation
DBMS
middleware
RDB
- Storing XML data in relations storage, query
processing, RDBMS transaction control, - Primary goal
- store part or entire XML documents content
based - increment existing relations, rather than build a
new one - directed by recursive XML schema
53XML shredding automata
Q
Reg
prereq
- Shredding automata vs. publishing transducers
- take an existing tree as input, rather than
relations - embedded XML queries, not relational, to compute
Reg - output union of relation registers tuples to
insert - combining XML SAX parsing and shredding, e.g.,
XML2DB - Primary goal expressive power and complexity
54Summary
- XML publishing a synergy between theory and
practice - characterization of XML publishing languages in
practice - expressive power and matching complexity bounds.
- helpful guidance for both the users and database
vendors - Dynamic aspects incremental publishing and view
updates. - important yet overlooked by and large
- Open research issues
- XML integration transducers
- XML shredding automata
- . . .
- An attempt to bridge theory and practice