Title: XML Publishing
1XML Publishing
- XML publishing overview
- SilkRoute (ATT)
- XPERANTO (IBM Research)
- Clio (IBM Research and U. Toronto)
- Schema-directed XML Publishing
- ATG (Bell Labs and U. Edinburgh)
- PRATA implementation of ATG
- Extensions of ATG
2XML -- its an exchange format
Data base
Data base
HTML
Data base!
- Most data is stored in pre-existing databases
- Need to provide XML wrappers to export data
3XML publishing
- An XML view definition language specifying
desired mapping - Efficient implementation of the view language
Q XML view
4From relations to XML Views
ltActorgt ltLnamegtViterellilt/Lnamegt
ltFnamegtJoelt/Fnamegt ltMovie year1999gt Analyze
This lt/Moviegt ltMovie year2001gt See Spot
Run lt/Moviegt lt/Actorgt ltActorgt
ltLnamegtWinterlt/Lnamegt ltFnamegtAlexlt/Fnamegt
ltMovie year1988gt Bill and Teds Excellent
Adventure lt/Moviegt
Actor(aid, lname, fname) lt001, Viterelli,
Joegt, Movie(mid, title, year) lt011,
Analyze This, 1999gt, lt032, See Spot Run,
2001gt Appearance(mid, aid) lt001, 011gt,
lt001, 032gt
5Commercial systems -- canonical publishing
- Canonical publishing the universal-relation
approach - Embedding single SQL query in XSL stylesheet
- Result canonical XML representation of
relations - Systems
- Oracle 10g XML SQL facilities SQL/XML, XMLGen
- IBM DB2 XML Extender SQL/XML, DAD
- Microsoft SQL Server 2005 FOR-XML, XSD
- incapable of expressing practical XML publishing
default fixed XML document template
6Canonical publishing
A.lname A.fname M.year M.title Viterelli
Joe 1999 See Spot Run Viterelli Joe
2001 Analyze This
IBM DB2 SQL Statement giving big relation
ltSQL_stmtgt SELECT A.lname,A.fname, M.year,
M.title FROM Movie M, Actor A, Appearance
Ap WHERE M.midAp.mid AND M.aidA.Aid ORDER BY
aid lt/SQL_stmtgt
ltActorgt ltLnamegtViterellilt/Lnamegt
ltFnamegtJoelt/Fnamegt ltMovie year1999gt Analyze
This lt/Moviegt ltMovie year2001gt See Spot
Run lt/Moviegt lt/Actorgt ltActorgt
ltLnamegtWinterlt/Lnamegt ltFnamegtAlexlt/Fnamegt
Formatting Template annotated with columns of
the universal relation
ltelement_node Actorgt ltelement_node Lnamegt
lttext_nodegt ltColumn nameA.lname/gt
lt/text_nodegt
7Middleware Approach
View Definition
Query-cost Estimates Source Capabilities
Query Generator
Request
SQL Queries
Tagger
8XPERANTO
- Commercial system IBM DB2 XML extender,
SQL/XML - Middleware (vendor-independent) XPERANTO
- Extending SQL with XML constructors
- select XML-aggregation
- from R1, . . ., Rn
- where conditions
- XML constructors (XML-aggregation) functions
- Input tables and XML trees (forest)
- Output XML tree
9XML publishing with XPERANTO (SQL/XML)
- Relational schema
- Extended SQL
- select XMLAGG( ACT(lname, fname,
- select XMLAGG ( MOV(title, year)
- from Appearance Ap, Movie M
- where Ap.aid A.aid and Ap.mid
M.mid - group order by A.lname, A. fname
)) - from Actor A
ltActorgt ltLnamegtViterellilt/Lnamegt
ltFnamegtJoelt/Fnamegt ltMovie year1999gt . . .
lt/Moviegt ltMovie year2001gt . . .
lt/Moviegt lt/Actorgt ltActorgt ltLnamegtWinterlt/Lnamegt
ltFnamegtAlexlt/Fnamegt
10XML constructors
- Actor constructor
- create function ACT(lname str, fname str,
mlist XML) - ltActorgt
- ltLnamegt lname lt/Lnamegt
- ltFnamegt fname lt/Fnamegt
- mlist
- lt/Actorgt
- Movie constructor (mlist)
- create function Mov(title str, year int)
- ltMovie yearyeargt title lt/Moviegt
- Verbose and cumbersome
- small document tedious
- large documents unthinkable
11SilkRoute
- Annotated template embedding SQL in a fixed XML
tree - Middleware SilkRoute
- Commercial SQL Server 2005 XSD, IBM DB2 DAD
- Advantages
- More modular comparing to the universal
relation approach - Limited schema-driven conforming to a fixed doc
template
12Clio
- Schema mapping based on inter-schema constraints
- Given two schemas S1 and S2, we want to translate
instances of S1 to instances of S2.
Constraint-driven approach - Start with a set of referential integrity
constraints (foreign keys) from S1 to S2 (schema
matching) - Derive schema mapping from S1 to S2, by reasoning
about the given integrity constraints (chasing
technique) - Pros and cons
- Generic system, and can be used for XML
publishing - Semi-automated
- Chasing may not terminates for XML DTD/Schema
- Recursive target schema (S2)? unclear
13Getting real data exchange on the Web
- All members of a community (industry) agree on a
DTD and then exchange data w.r.t. it
e-commerce, health-care, ... - XML Publishing
- mapping relational data to XML
- conforming to the predefined DTD
Web
DTD
XML
XML
Q XML view
DB1
DB2
14Data exchange insurance company and hospital
- Daily report
-
- Relational database R at the hospital
- Patient (SSN, name, tname, policy, date)
- inTreatment (tname, cost)
- outTreatment (tname, referral)
- Procedure (tname1, tname2)
- treatment
- in hospital composition hierarchy in Procedure
- outside of the hospital referral
hospital
insurance company
XML view
R
XML
15Example insurance company and hospital
- DTD D predefined by the insurance company
- report ? patient
- patient ? SSN, pname,
treatment, policy - treatment ? tname, (inTreatment
outTreatment) - inTreatment ? treatment
- outTreatment ? referral
- How to define a mapping ? such that for any
instance DB of R, - ? (DB) is an XML document containing all the
patients and their treatments (hierarchy,
referral) from DB, and - ? (DB) conforms to D?
16Challenge recursive type
- XML data unbounded depth -- cannot be decided
statically - treatment ? tname, (inTreatment
outTreatment) - inTreatment ? treatment ---
recursive
SSN
123
17Challenge non-determinism
- The choice of a production (element type
definition) - treatment ? tname, (inTreatment outTreatment)
- -- depends on the underlining relational data
report
18Existing systems
- fixed XML tree template or ignoring
DTD-conformance - middleware SilkRoute (ATT), XPERANTO (IBM),
- systems SQL Server 2005, IBM DB2 XML extender,
- incapable of coping with a predefined DTD (e.g.
recursion) - type checking define a view and then check its
conformance - undecidable in general, co-NEXPTIME for extremely
restricted view definitions - no guidance on how to define XML views that
typecheck - one gets an XML view that typechecks only after
repeated failures and with luck
19XML Publishing
- XML publishing overview
- SilkRoute (ATT)
- XPERANTO (IBM Research)
- Clio (IBM Research and U. Toronto)
- Schema-directed XML Publishing
- ATG (Bell Labs and U. Edinburgh)
- PRATA implementation of ATG
- Extensions of ATG
20Attribute Translation Grammar (ATG)
- DTD normalized element type definitions e ? ?
- ? PCDATA ? e1, , en e1
en e - Attributes e associated with each element type
e - e tuple-valued, to pass data value as well as
control - Rules associated with each e ? ? for e in ?,
e Q(e) - SQL query Q extracts data from DB
- parent attribute e as a constant parameter in Q
21Semantics conceptual evaluation
- Top-down
- report ? patient
- patient ? select SSN, name, tname,
policy - from Patient --- SQL
query - recall Patient (SSN, name, tname, policy)
- Data-driven a patient element for each tuple in
Patient relation
patient
patient
patient
patient
22Inherited attributes
- Inherited child is computed using parent
- patient ? SSN, name, treatment, policy
- SSN patient.SSN, name
patient.name - treatment patient.tname policy
patient.policy - recall patient (SSN, name, tname, policy)
- SSN ? PCDATA
- PCDATA SSN
report
...
patient
patient
patient
patient
patient
treatment
name
policy
SSN
SSN
123
Joe
LU23
PCDATA
23Coping with non-determinism
- treatment ? tname, (inTreatment
outTreatment) - tname treatment
- (inTreatment, outTreatment)
- case Qc(treatment).tag
--- conditional query - 1 (treatment, null)
- else (null, treatment)
- Qc select 1 as tag from inTreatment where
tname treatment - conditional query the choice of production
- parent as constant parameter in SQL query
...
treatment
treatment
tname
tname
inTreatment
outTreatment
24Coping with recursion
- inTreatment ? treatment
- treatment ? select tname2
- from Procedure
- where inTreatment tname1
- recall Procedure (tname1, tname2)
- parent as constant parameter in SQL query Q
- inTreatment is further expanded as long as Q(DB)
is nonempty
treatment
treatment
25DTD-directed publishing with ATGs
- DTD-directed the XML tree is constructed
strictly following the productions of a DTD ---
DTD conformance - Data-driven the choice of productions and
expansion of the XML tree (recursion) depends on
relational data
report
...
patient
patient
patient
patient
treatment
name
SSN
inTreatment
...
tname
Joe
123
treatment
treatment
...
...
26ATGs vs. existing systems
- DTD-conformance
- ATGs provide guidance for how to define
DTD-directed publishing - Other systems based on a fixed tree template
- Expressive power strictly more expressive than
others - ATGs capable of expressing XML views supported
by other systems - Other systems cannot handle recursion/nondetermin
ism
27ATGs vs. Attribute Grammars (AGs)
- AGs
- Definition w.r.t. a CFG
- Evaluation parse a string with the CFG, then
evaluate attributes given the parse tree - ATGs combining DTD and database operations
- Definition w.r.t. an ECFG (DTD) and SQL queries
- Evaluation given DB, extract relevant data from
DB with SQL queries to build an XML tree of the
DTD - It does not make sense to parse a database
w.r.t. a DTD - ATGs are not a mild variation of AGs
28XML Publishing
- XML publishing overview
- SilkRoute (ATT)
- XPERANTO (IBM Research)
- Clio (IBM Research and U. Toronto)
- Schema-directed XML Publishing
- ATG (Bell Labs and U. Edinburgh)
- PRATA implementation of ATG
- Possible extensions of ATG
29PRATA middleware based on ATGs
query plan generation, evaluation
relations
ATG graph
XML
ATG
tagging
parsing
cost estimate query
statistics query results
- ATG graph representing the ATG
- relations representing root-leaf paths ---gt
- tagging one pass
30Evaluation of ATGs
- Conceptual evaluation defining semantics, but
not efficient - Techniques proved useful
- partitioning reduce db visits, make use of DBMS
optimizer (handling annotated templates
complete tree) - Relations representing XML trees root-leaf paths
- Separate tagging
31ATG graph
- edge labels SQL queries
- cyclic introduced by recursion -- not in
other systems - iterative unfolding partial ATG tree to a
depth d
report
Q1
patient
Q5
Q2
policy
Q3
Q4
Q8
name
SSN
treatment
Q7
Qc
tname
outTreatment
inTreatment
Q9
Q6
referral
procedure
32Partitioning partial ATG trees
- query composition (outer join)
- challenge how to partition into clusters?
- finer ? fewer nulls, smaller queries sent to
DBMS - coarser ? leverage DBMS optimizer, less DB visits
- Finding an optimal strategy NP-hard
report
composed Q
Q1
patient
Q2
Q5
Q3
SSN
policy
Q4
name
composed Q
treatment
Qc
Q7
outTreatment
tname
inTreatment
Q9
Q6
procedure
referral
...
Q8
treatment
33Materialization of intermediate results
- Partitioning root-leaf relation R ?
intermediate R1, R2, R3, R4 - repeated computation each Ri carrying a key from
the root - materialization storing keys in temporary
tables - cons overhead with temporary tables
- pros reducing recomputation in descendant
queries - What keys to materialize? Not supported by other
systems
output relation R
R1
sort merge
key
R2
R3
sort merge
key
R4
34Query plan generation
- New challenge find best partitioning and
materialization - mutually dependent partitioning and
materialization must be taken together - expensive exponentially many choices NP-hard
- Efficient cost-based heuristic
- DBMS statistics arity/cardinality of a table,
cost to evaluate a query/create temporary table - cost estimate simple formulae for computing the
benefit of materialization/partitioning --
effective in practice - optimal strategy -- dynamic programming
decides partitioning w.r.t. the benefit of
materialization
35Putting these together
- ATGs
- ensure DTD-conformance automatically
- support recursion/non-determinism naturally
- simple no need for a new language knowing DTD
SQL is all that one needs for writing an ATG - PRATA middleware based on ATGs
- novel technique of combining partitioning and
materialization - a practical solution for DTD-directed publishing
- ATG the first systematic way for DTD-directed
publishing
36XML Publishing
- XML publishing overview
- SilkRoute (ATT)
- XPERANTO (IBM Research)
- Clio (IBM Research and U. Toronto)
- Schema-directed XML Publishing
- ATG (Bell Labs and U. Edinburgh)
- PRATA implementation of ATG
- Extensions of ATG
37Extension Capturing integrity constraints
- XML schema
- Type (DTD)
- Integrity constraints keys, foreign keys
- Schema-directed XML publishing automatically
guarantee that the target document both conforms
to the type and satisfies the constraints -- in
a single framework - Challenge consistency analyses undecidable
DTD
XML ATG
DB
38Data integration in XML
- multiple, heterogeneous data sources
multi-source queries, query decomposition, object
fusion, - distributed sources scheduling of query
execution - schema-conformance,
query
answer
schema
(D, ?)
XML
query translation
integration
Integration middleware
updates
DB4
DB1
DB2
DB3
39Example hospital and insurance company
- patient (SSN, name, policy)
- visitInfo (SSN, trId, date)
billing (trId, price)
cover (policy, trId)
daily XML report
treatment (trId, name) procedure (trId1, trId2)
- Given a date, for each patient of that day,
report - SSN, name (DB1)
- treatments (hierarchy) covered by insurance
(DB1, DB3, DB4) - cost of all and only those treatments received
(DB2)
40Predefined schema (D, ?) DTD
- report ? patient
- patient ? SSN, name,
treatments, bill - treatments ? treatment
- treatment ? trId, tname,
treatments - bill ? item
- item ? trId, price
report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
. . .
price
trId
trId
tname
treatments
41Predefined schema (D, ?) XML constraints ?
- constraints relative to each patient,
- key each treatment is charged only once
- patient ( item.trId ? item )
- foreign key every treatment has a billing record
- patient ( treatment.trId ? item.trId )
report
date
patient
patient
patient
name
treatments
SSN
bill
treatment
item
item
treatment
. . .
price
trId
treatments
42More on XML integrity constraints
- absolute on the entire document
- key country.name ? country
- relative on a subdocument rooted at a country
- key country (province.name ?
province) - foreign key country (capital.inProvince ?
province.name) - (value inclusion)
43Challenge context dependency
- bill subtree all and only the trIds in the
treatments subtree - controlled derivation the bill subtree cannot be
started before the treatments subtree is
completed. - information passing downward, upward, sideways
date
report
patient
date
SSN
treatments
name
SSN
bill
trIdS
trId
trId
unbounded
44Challenges
- DTD-conformance recursive, nondeterministic
- integrity constraints validation during
document generation - multi-source queries a single one involves
several databases - context-dependency not strictly top-down or
bottom-up - Previous work
- XML publishing single data source, no
constraints, top-down. - XML integration little support for XML
schema-conformance. - XML query languages type checking provide no
guidance for how to ensure schema-conformance
optimization hard.
Schema-directed XML integration is nontrivial!
45Middleware for schema-directed integration
view definition language
optimization techniques
- A lightweight language Attribute Integration
Grammar (AIG) - Cost-based optimization in light of
context-dependency
DTD constraints
semantic attributes
semantic rules
46Attribute Integration Grammar (AIG)
- DTD element type definitions e ? ?
- ? PCDATA ? e1, , en
e1 en e - Attributes associated with each element type e
- Inh(e) inherited from parent/siblings
(top-down/sideways) - Syn(e) synthesized from children (bottom-up)
- Syn(e), Inh(e) tuple or set/bag-valued
- Rules associated with each production e ? ?, for
e in ? - Inh(child) Inh(e) Q(Inh(parent),
Syn(sibling) ) - Syn(parent) Syn(e) U (Syn(children))
-- union - Q multi-source SQL query with parameters
- Dependency e2 must be evaluated before e1 if
- Inh(e1) Q( Syn(e2) ) -- acyclic graph
(DAG)
47AIG semantics conceptual evaluation
- following the dependency ordering starting from
the root - report ? patient
- Inh(patient) ? Q1 (Inh(report))
- select Inh(report) as date, p.SSN,
p.name, p.policy - from DB1 patient p, DB1
visitInfo v - where p.SSN v.SSN and v.date
Inh(report) - Recall DB1 patient (SSN, name, policy),
visitInfo (SSN, trId, date) - Parameter in a query Inh(report) as a constant
- Data driven the number of patients depends on Q1
report
Inh
patient
patient
patient
48Multi-source query
- patient ? SSN, name, treatments, bill
- Inh(SSN) Inh(patient).SSN, . . . ,
- Inh(treatments) Q2(v) -- v
Inh(patient) - select t.trId, t.tname
- from DB1 visitInfo i, DB3 cover
c, DB4 treatment t - where i.SSN v.SSN and i.date
v.date and t.trId i.trId - and c.trId i.trId and c.policy
v.policy - a single query uses DB1, DB3 and DB4
- tuple- and set-valued attributes (Inh(SSN),
Inh(treatments))
- Recall DB1 patient(SSN, name, policy),
visitInfo(SSN, trId, date) - DB3 cover (policy, trId)
- DB4 treatment (trId, name),
procedure (trId1, trId2)
49Initial top-down pass context-dependent
- patient ? SSN, name, treatments, bill
- Inh(SSN) Inh(patient).SSN, Inh(name)
Inh(patient).name, - Inh(treatments) Q2(Inh(patient))
- Inh(bill) Syn(treatments) lt--
halt - DTD-directed generate children following the
production - Inh(bill) defined with sibling --
Syn(treatments), dependency ordering evaluate
bill after treatments
lt- - halt
50Initial top-down pass recursion
- treatments ? treatment
- Inh(treatment) ? Inh(treatments) -- set of
(trId, tname) - Data driven treatments expansion depends on
Inh(treatments) - empty expansion terminates, Syn(treatments) is
empty - nonempty expands.
51Leaf step
- treatment ? trId, tname, treatments
- Inh(trId) Inh(treatment).trId, . . .,
-
- trId ? PCDATA
Syn(trId) Inh(trId)
treatments
Inh (trId, tname)
treatment
treatment
treatments
tname
trId
Inh
52Bottom-up step synthesize attributes
- treatment ? trId, tname, treatments
- treatments ? treatment
- Syn(treatments) U Syn(treatment)
- Processing of an element e Inh(e) ? subtree(e) ?
Syn(e)
Syn(treatment) Syn(trId) ? Syn(treatments)
53Sideways step controlled derivation
- patient ? SSN, name, treatments, bill
- Inh(bill) Syn(treatments)
- bill ? item
- Inh(item) ? Q(Inh(bill) )
- select trId, price
- from DB2 billing
- where trId in Inh(bill) -- set
membership test - DTD-directed each step of construction follows a
production
Recall DB2 billing (trId, price)
54Constraint compilation
- Captured with rules on synthesized attributes of
patient - trIdB bag-valued, collecting trIds under item
- trIdS1 set-valued, collecting trIds under
treatment - trIdS2 set-valued, collecting trIds under item
- key patient ( item.trId ? item )
- unique (Syn(patient).trIdB) -- no
duplicates in the bag - foreign key patient ( treatment.trId ?
item.trId ) - subset ( Syn(patient).trIdS1,
Syn(patient).trIdS2) - compilation semantic rules and attributes for
constraints are automatically generated and
evaluated
55Advantages of AIG
- DTD-directed view definition automatically
ensures conformance to DTD -- recursive,
nondeterministic - Constraint compilation automatically captures
integrity constraints in a uniform framework - performance avoid post-materialization checking
- optimization jointly with query evaluation
- exception handling actions when constraints are
violated - Controlled derivation supports context-dependent
generation. - Information passing top-down, bottom-up,
sideways - Multi-source queries optimizer-based
decomposition - One sweep each node is visited at most twice
evaluates its inherited attribute, subtree, then
its synthesized attribute
56Middleware evaluation of AIGs
optimizer
AIG
XML
merging
query plan execution
pre-processing
tagging
scheduling
data
query
DB3
cost statistics
DB2
- pre-processing
- constraint compilation
- multi-source query decomposition ? single-source
queries - optimizer query plan generation using cost
statistics - execution SQL queries ? data sources, results
? mediator - tagging relational tables (paths from root) ?
XML view via merge-sorting
57Optimization
- Goal reduce response time
- costs query execution, data transfer, storage
(caching) - constraints query dependency graph (DAG)
- nodes queries computing inherited/synthesized
attributes - edges dependency relation (producer-consumer
relation) - recursion iterative unfolding by a certain depth
- Optimization techniques
- query merging
- query scheduling
58Query scheduling
- Goal reduce the total response time by
increasing parallelism - Ordering execution of queries on the same site
- e.g., ltQ4, Q5, Q3gt vs. ltQ3, Q5, Q4gt on DB2
- Constraints
- costs execution, communication, caching
overheads - dependency relation
- Finding an optimal schedule NP-hard
59Query merging
- Goal reduce DB visits, leverage DB optimizer
- Composition of queries on the same site
outer-join/union - Tradeoffs
- large result tables with null
communication/caching cost - impact on scheduling -- changing query dependency
graph
Finding an optimal strategy for merging and
scheduling NP-hard
60Cost-based heuristic scheduling
- cost estimate given a fixed schedule, estimate
the completion time of a query Q, comp_time(Q) - statistics eval_cost(Q), size(Q), trans_cost(S,
S, size) - dependency Q cant start before comp_time(Q) if
Q -gt Q - scheduling given a fixed query dependency graph,
a heuristic based on dynamic programming to - find the most costly trailing path of each
query - sort queries to favor critical paths
61Cost-based heuristic merging
- greedy algorithm repeat until no further
improvement - merge each pair of queries on the same source
- modify the query dependency graph accordingly
- invoke scheduling w.r.t. the modified dependency
- estimate the cost
- pick the pair with the biggest improvement to
merge
Interaction between query scheduling and merging
62AIG Summary
- AIG a novel specification language
- ensures DTD-conformance (recursion/nondeterminism)
- captures integrity constraints in a uniform
framework - supports complex transformations controlled
derivation (context-dependency), multi-source
queries, . . . - Optimization techniques nontrivial optimization
problems - constraint compilation, multi-source query
decomposition - query scheduling w.r.t. query dependency graph
- query merging and its interaction with scheduling
63More on Commercial System MS SQL Server 2005
- Annotated schema (XSD) fixed tree templates
- nonrecursive schema
- associate elements and attributes with table and
column names - Given a relational database, XSD populates an XML
elements/attributes with corresponding
tuples/columns - FOR-XML
- An extension of SQL with an FOR-XML construct
- Nested FOR-XML to construct XML documents
- Summary
- incapable of supporting schema-directed
publishing - cant define recursive XML views (bounded
recursion depth)
64Commercial System IBM DB2 XML Extender
- User-defined mapping through DAD (Document Access
Definition) a fixed XML tree template
(nonrecusive) - SQL mapping a single SQL query, constructing XML
trees of depth bounded by the arity of the tuples
returned and group-by - RDB node mapping a fixed tree template with
nodes annotated with conjunctive queries - SQL/XML an extension of SQL with XML
constructors (XMLAGG, XMLELEMENT, etc) as
discussed earlier - Summary
- incapable of supporting schema-directed
publishing - cant define recursive XML views
65Commercial System Oracle 10g XML DB
- SQL/XML
- DBMS_XMLGEN, a PL/SQL package
- Supports recursive XML view definition (via
linear recursion of SQL99) - does not support schema-directed XML publishing
66Summary and review
- Why publish XML data? What is the major
difficulty? - One can publish relational data via an XQuery
view. How to do it? What are the pros and cons of
this approach? - What is schema-directed publishing? Why do we
need it? - Why does ATG automatically ensure
DTD-conformance? - How to ensure both DTD conformance and constraint
satisfaction? - Homework
- Consider projects related to ATG
- How to answer queries over XML views? Read papers
on query composition (SilkRoute, XPERANTO).