XML Publishing

About This Presentation

Title:

XML Publishing

Description:

Given two schemas S1 and S2, we want to translate instances of ... one gets an XML view that typechecks only after repeated failures and with luck. QSX (LN 5) ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 67

Provided by: SchoolofI9

Category:

more less

Transcript and Presenter's Notes

Title: XML Publishing

1
XML Publishing

XML publishing overview
SilkRoute (ATT)
XPERANTO (IBM Research)
Clio (IBM Research and U. Toronto)
Schema-directed XML Publishing
ATG (Bell Labs and U. Edinburgh)
PRATA implementation of ATG
Extensions of ATG

2
XML -- its an exchange format
Data base
Data base
HTML
Data base!

Most data is stored in pre-existing databases
Need to provide XML wrappers to export data

3
XML publishing

An XML view definition language specifying
desired mapping
Efficient implementation of the view language

Q XML view
4
From relations to XML Views
ltActorgt ltLnamegtViterellilt/Lnamegt
ltFnamegtJoelt/Fnamegt ltMovie year1999gt Analyze
This lt/Moviegt ltMovie year2001gt See Spot
Run lt/Moviegt lt/Actorgt ltActorgt
ltLnamegtWinterlt/Lnamegt ltFnamegtAlexlt/Fnamegt
ltMovie year1988gt Bill and Teds Excellent
Adventure lt/Moviegt
Actor(aid, lname, fname) lt001, Viterelli,
Joegt, Movie(mid, title, year) lt011,
Analyze This, 1999gt, lt032, See Spot Run,
2001gt Appearance(mid, aid) lt001, 011gt,
lt001, 032gt
5
Commercial systems -- canonical publishing

Canonical publishing the universal-relation
approach
Embedding single SQL query in XSL stylesheet
Result canonical XML representation of
relations
Systems
Oracle 10g XML SQL facilities SQL/XML, XMLGen
IBM DB2 XML Extender SQL/XML, DAD
Microsoft SQL Server 2005 FOR-XML, XSD
incapable of expressing practical XML publishing
default fixed XML document template

6
Canonical publishing
A.lname A.fname M.year M.title Viterelli
Joe 1999 See Spot Run Viterelli Joe
2001 Analyze This
IBM DB2 SQL Statement giving big relation
ltSQL_stmtgt SELECT A.lname,A.fname, M.year,
M.title FROM Movie M, Actor A, Appearance
Ap WHERE M.midAp.mid AND M.aidA.Aid ORDER BY
aid lt/SQL_stmtgt
ltActorgt ltLnamegtViterellilt/Lnamegt
ltFnamegtJoelt/Fnamegt ltMovie year1999gt Analyze
This lt/Moviegt ltMovie year2001gt See Spot
Run lt/Moviegt lt/Actorgt ltActorgt
ltLnamegtWinterlt/Lnamegt ltFnamegtAlexlt/Fnamegt
Formatting Template annotated with columns of
the universal relation
ltelement_node Actorgt ltelement_node Lnamegt
lttext_nodegt ltColumn nameA.lname/gt
lt/text_nodegt
7
Middleware Approach
View Definition
Query-cost Estimates Source Capabilities
Query Generator
Request
SQL Queries
Tagger
8
XPERANTO

Commercial system IBM DB2 XML extender,
SQL/XML
Middleware (vendor-independent) XPERANTO
Extending SQL with XML constructors
select XML-aggregation
from R1, . . ., Rn
where conditions
XML constructors (XML-aggregation) functions
Input tables and XML trees (forest)
Output XML tree

9
XML publishing with XPERANTO (SQL/XML)

Relational schema
Extended SQL
select XMLAGG( ACT(lname, fname,
select XMLAGG ( MOV(title, year)
from Appearance Ap, Movie M
where Ap.aid A.aid and Ap.mid
M.mid
group order by A.lname, A. fname
))
from Actor A

ltActorgt ltLnamegtViterellilt/Lnamegt
ltFnamegtJoelt/Fnamegt ltMovie year1999gt . . .
lt/Moviegt ltMovie year2001gt . . .
lt/Moviegt lt/Actorgt ltActorgt ltLnamegtWinterlt/Lnamegt
ltFnamegtAlexlt/Fnamegt
10
XML constructors

Actor constructor
create function ACT(lname str, fname str,
mlist XML)
ltActorgt
ltLnamegt lname lt/Lnamegt
ltFnamegt fname lt/Fnamegt
mlist
lt/Actorgt
Movie constructor (mlist)
create function Mov(title str, year int)
ltMovie yearyeargt title lt/Moviegt
Verbose and cumbersome
small document tedious
large documents unthinkable

11
SilkRoute

Annotated template embedding SQL in a fixed XML
tree
Middleware SilkRoute
Commercial SQL Server 2005 XSD, IBM DB2 DAD
Advantages
More modular comparing to the universal
relation approach
Limited schema-driven conforming to a fixed doc
template

12
Clio

Schema mapping based on inter-schema constraints
Given two schemas S1 and S2, we want to translate
instances of S1 to instances of S2.
Constraint-driven approach
Start with a set of referential integrity
constraints (foreign keys) from S1 to S2 (schema
matching)
Derive schema mapping from S1 to S2, by reasoning
about the given integrity constraints (chasing
technique)
Pros and cons
Generic system, and can be used for XML
publishing
Semi-automated
Chasing may not terminates for XML DTD/Schema
Recursive target schema (S2)? unclear

13
Getting real data exchange on the Web

All members of a community (industry) agree on a
DTD and then exchange data w.r.t. it
e-commerce, health-care, ...
XML Publishing
mapping relational data to XML
conforming to the predefined DTD

Web
DTD
XML
XML
Q XML view
DB1
DB2
14
Data exchange insurance company and hospital

Daily report
Relational database R at the hospital
Patient (SSN, name, tname, policy, date)
inTreatment (tname, cost)
outTreatment (tname, referral)
Procedure (tname1, tname2)
treatment
in hospital composition hierarchy in Procedure
outside of the hospital referral

hospital
insurance company
XML view
R
XML
15
Example insurance company and hospital

DTD D predefined by the insurance company
report ? patient
patient ? SSN, pname,
treatment, policy
treatment ? tname, (inTreatment
outTreatment)
inTreatment ? treatment
outTreatment ? referral
How to define a mapping ? such that for any
instance DB of R,
? (DB) is an XML document containing all the
patients and their treatments (hierarchy,
referral) from DB, and
? (DB) conforms to D?

16
Challenge recursive type

XML data unbounded depth -- cannot be decided
statically
treatment ? tname, (inTreatment
outTreatment)
inTreatment ? treatment ---
recursive

SSN
123
17
Challenge non-determinism

The choice of a production (element type
definition)
treatment ? tname, (inTreatment outTreatment)
-- depends on the underlining relational data

report
18
Existing systems

fixed XML tree template or ignoring
DTD-conformance
middleware SilkRoute (ATT), XPERANTO (IBM),
systems SQL Server 2005, IBM DB2 XML extender,
incapable of coping with a predefined DTD (e.g.
recursion)
type checking define a view and then check its
conformance
undecidable in general, co-NEXPTIME for extremely
restricted view definitions
no guidance on how to define XML views that
typecheck
one gets an XML view that typechecks only after
repeated failures and with luck

19
XML Publishing

XML publishing overview
SilkRoute (ATT)
XPERANTO (IBM Research)
Clio (IBM Research and U. Toronto)
Schema-directed XML Publishing
ATG (Bell Labs and U. Edinburgh)
PRATA implementation of ATG
Extensions of ATG

20
Attribute Translation Grammar (ATG)

DTD normalized element type definitions e ? ?
? PCDATA ? e1, , en e1
en e
Attributes e associated with each element type
e
e tuple-valued, to pass data value as well as
control
Rules associated with each e ? ? for e in ?,
e Q(e)
SQL query Q extracts data from DB
parent attribute e as a constant parameter in Q

21
Semantics conceptual evaluation

Top-down
report ? patient
patient ? select SSN, name, tname,
policy
from Patient --- SQL
query
recall Patient (SSN, name, tname, policy)
Data-driven a patient element for each tuple in
Patient relation

patient
patient
patient
patient
22
Inherited attributes

Inherited child is computed using parent
patient ? SSN, name, treatment, policy
SSN patient.SSN, name
patient.name
treatment patient.tname policy
patient.policy
recall patient (SSN, name, tname, policy)
SSN ? PCDATA
PCDATA SSN

report
...
patient
patient
patient
patient
patient
treatment
name
policy
SSN
SSN
123
Joe
LU23
PCDATA
23
Coping with non-determinism

treatment ? tname, (inTreatment
outTreatment)
tname treatment
(inTreatment, outTreatment)
case Qc(treatment).tag
--- conditional query
1 (treatment, null)
else (null, treatment)
Qc select 1 as tag from inTreatment where
tname treatment
conditional query the choice of production
parent as constant parameter in SQL query

...
treatment
treatment
tname
tname
inTreatment
outTreatment
24
Coping with recursion

inTreatment ? treatment
treatment ? select tname2
from Procedure
where inTreatment tname1
recall Procedure (tname1, tname2)
parent as constant parameter in SQL query Q
inTreatment is further expanded as long as Q(DB)
is nonempty

treatment
treatment
25
DTD-directed publishing with ATGs

DTD-directed the XML tree is constructed
strictly following the productions of a DTD ---
DTD conformance
Data-driven the choice of productions and
expansion of the XML tree (recursion) depends on
relational data

report
...
patient
patient
patient
patient
treatment
name
SSN
inTreatment
...
tname
Joe
123
treatment
treatment
...
...
26
ATGs vs. existing systems

DTD-conformance
ATGs provide guidance for how to define
DTD-directed publishing
Other systems based on a fixed tree template
Expressive power strictly more expressive than
others
ATGs capable of expressing XML views supported
by other systems
Other systems cannot handle recursion/nondetermin
ism

27
ATGs vs. Attribute Grammars (AGs)

AGs
Definition w.r.t. a CFG
Evaluation parse a string with the CFG, then
evaluate attributes given the parse tree
ATGs combining DTD and database operations
Definition w.r.t. an ECFG (DTD) and SQL queries
Evaluation given DB, extract relevant data from
DB with SQL queries to build an XML tree of the
DTD
It does not make sense to parse a database
w.r.t. a DTD
ATGs are not a mild variation of AGs

28
XML Publishing

XML publishing overview
SilkRoute (ATT)
XPERANTO (IBM Research)
Clio (IBM Research and U. Toronto)
Schema-directed XML Publishing
ATG (Bell Labs and U. Edinburgh)
PRATA implementation of ATG
Possible extensions of ATG

29
PRATA middleware based on ATGs
query plan generation, evaluation
relations
ATG graph
XML
ATG
tagging
parsing
cost estimate query
statistics query results

ATG graph representing the ATG
relations representing root-leaf paths ---gt
tagging one pass

30
Evaluation of ATGs

Conceptual evaluation defining semantics, but
not efficient
Techniques proved useful
partitioning reduce db visits, make use of DBMS
optimizer (handling annotated templates
complete tree)
Relations representing XML trees root-leaf paths
Separate tagging

31
ATG graph

edge labels SQL queries
cyclic introduced by recursion -- not in
other systems
iterative unfolding partial ATG tree to a
depth d

report
Q1

patient
Q5
Q2
policy
Q3
Q4
Q8
name
SSN
treatment
Q7
Qc
tname
outTreatment
inTreatment
Q9
Q6
referral
procedure

32
Partitioning partial ATG trees

query composition (outer join)
challenge how to partition into clusters?
finer ? fewer nulls, smaller queries sent to
DBMS
coarser ? leverage DBMS optimizer, less DB visits
Finding an optimal strategy NP-hard

report
composed Q
Q1

patient
Q2
Q5
Q3
SSN
policy
Q4
name
composed Q
treatment
Qc
Q7
outTreatment
tname
inTreatment
Q9
Q6
procedure
referral
...

Q8
treatment
33
Materialization of intermediate results

Partitioning root-leaf relation R ?
intermediate R1, R2, R3, R4
repeated computation each Ri carrying a key from
the root
materialization storing keys in temporary
tables
cons overhead with temporary tables
pros reducing recomputation in descendant
queries
What keys to materialize? Not supported by other
systems

output relation R
R1
sort merge
key
R2
R3
sort merge
key
R4
34
Query plan generation

New challenge find best partitioning and
materialization
mutually dependent partitioning and
materialization must be taken together
expensive exponentially many choices NP-hard
Efficient cost-based heuristic
DBMS statistics arity/cardinality of a table,
cost to evaluate a query/create temporary table
cost estimate simple formulae for computing the
benefit of materialization/partitioning --
effective in practice
optimal strategy -- dynamic programming
decides partitioning w.r.t. the benefit of
materialization

35
Putting these together

ATGs
ensure DTD-conformance automatically
support recursion/non-determinism naturally
simple no need for a new language knowing DTD
SQL is all that one needs for writing an ATG
PRATA middleware based on ATGs
novel technique of combining partitioning and
materialization
a practical solution for DTD-directed publishing
ATG the first systematic way for DTD-directed
publishing

36
XML Publishing

XML publishing overview
SilkRoute (ATT)
XPERANTO (IBM Research)
Clio (IBM Research and U. Toronto)
Schema-directed XML Publishing
ATG (Bell Labs and U. Edinburgh)
PRATA implementation of ATG
Extensions of ATG

37
Extension Capturing integrity constraints

XML schema
Type (DTD)
Integrity constraints keys, foreign keys
Schema-directed XML publishing automatically
guarantee that the target document both conforms
to the type and satisfies the constraints -- in
a single framework
Challenge consistency analyses undecidable

DTD
XML ATG
DB
38
Data integration in XML

multiple, heterogeneous data sources
multi-source queries, query decomposition, object
fusion,
distributed sources scheduling of query
execution
schema-conformance,

query
answer
schema
(D, ?)
XML
query translation
integration
Integration middleware
updates
DB4
DB1
DB2
DB3
39
Example hospital and insurance company

patient (SSN, name, policy)
visitInfo (SSN, trId, date)

billing (trId, price)
cover (policy, trId)
daily XML report
treatment (trId, name) procedure (trId1, trId2)

Given a date, for each patient of that day,
report
SSN, name (DB1)
treatments (hierarchy) covered by insurance
(DB1, DB3, DB4)
cost of all and only those treatments received
(DB2)

40
Predefined schema (D, ?) DTD

report ? patient
patient ? SSN, name,
treatments, bill
treatments ? treatment
treatment ? trId, tname,
treatments
bill ? item
item ? trId, price

report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
. . .
price
trId
trId
tname
treatments
41
Predefined schema (D, ?) XML constraints ?

constraints relative to each patient,
key each treatment is charged only once
patient ( item.trId ? item )
foreign key every treatment has a billing record
patient ( treatment.trId ? item.trId )

report
date
patient
patient
patient
name
treatments
SSN
bill
treatment
item
item
treatment
. . .
price
trId
treatments
42
More on XML integrity constraints

absolute on the entire document
key country.name ? country
relative on a subdocument rooted at a country
key country (province.name ?
province)
foreign key country (capital.inProvince ?
province.name)
(value inclusion)

43
Challenge context dependency

bill subtree all and only the trIds in the
treatments subtree
controlled derivation the bill subtree cannot be
started before the treatments subtree is
completed.
information passing downward, upward, sideways

date
report
patient
date
SSN
treatments
name
SSN
bill
trIdS
trId
trId
unbounded
44
Challenges

DTD-conformance recursive, nondeterministic
integrity constraints validation during
document generation
multi-source queries a single one involves
several databases
context-dependency not strictly top-down or
bottom-up
Previous work
XML publishing single data source, no
constraints, top-down.
XML integration little support for XML
schema-conformance.
XML query languages type checking provide no
guidance for how to ensure schema-conformance
optimization hard.

Schema-directed XML integration is nontrivial!
45
Middleware for schema-directed integration

view definition language
optimization techniques

A lightweight language Attribute Integration
Grammar (AIG)
Cost-based optimization in light of
context-dependency

DTD constraints
semantic attributes

semantic rules
46
Attribute Integration Grammar (AIG)

DTD element type definitions e ? ?
? PCDATA ? e1, , en
e1 en e
Attributes associated with each element type e
Inh(e) inherited from parent/siblings
(top-down/sideways)
Syn(e) synthesized from children (bottom-up)
Syn(e), Inh(e) tuple or set/bag-valued
Rules associated with each production e ? ?, for
e in ?
Inh(child) Inh(e) Q(Inh(parent),
Syn(sibling) )
Syn(parent) Syn(e) U (Syn(children))
-- union
Q multi-source SQL query with parameters
Dependency e2 must be evaluated before e1 if
Inh(e1) Q( Syn(e2) ) -- acyclic graph
(DAG)

47
AIG semantics conceptual evaluation

following the dependency ordering starting from
the root
report ? patient
Inh(patient) ? Q1 (Inh(report))
select Inh(report) as date, p.SSN,
p.name, p.policy
from DB1 patient p, DB1
visitInfo v
where p.SSN v.SSN and v.date
Inh(report)
Recall DB1 patient (SSN, name, policy),
visitInfo (SSN, trId, date)
Parameter in a query Inh(report) as a constant
Data driven the number of patients depends on Q1

report
Inh
patient
patient
patient
48
Multi-source query

patient ? SSN, name, treatments, bill
Inh(SSN) Inh(patient).SSN, . . . ,
Inh(treatments) Q2(v) -- v
Inh(patient)
select t.trId, t.tname
from DB1 visitInfo i, DB3 cover
c, DB4 treatment t
where i.SSN v.SSN and i.date
v.date and t.trId i.trId
and c.trId i.trId and c.policy
v.policy
a single query uses DB1, DB3 and DB4
tuple- and set-valued attributes (Inh(SSN),
Inh(treatments))

Recall DB1 patient(SSN, name, policy),
visitInfo(SSN, trId, date)
DB3 cover (policy, trId)
DB4 treatment (trId, name),
procedure (trId1, trId2)

49
Initial top-down pass context-dependent

patient ? SSN, name, treatments, bill
Inh(SSN) Inh(patient).SSN, Inh(name)
Inh(patient).name,
Inh(treatments) Q2(Inh(patient))
Inh(bill) Syn(treatments) lt--
halt
DTD-directed generate children following the
production
Inh(bill) defined with sibling --
Syn(treatments), dependency ordering evaluate
bill after treatments

lt- - halt
50
Initial top-down pass recursion

treatments ? treatment
Inh(treatment) ? Inh(treatments) -- set of
(trId, tname)
Data driven treatments expansion depends on
Inh(treatments)
empty expansion terminates, Syn(treatments) is
empty
nonempty expands.

51
Leaf step

treatment ? trId, tname, treatments
Inh(trId) Inh(treatment).trId, . . .,
trId ? PCDATA

Syn(trId) Inh(trId)
treatments
Inh (trId, tname)
treatment
treatment
treatments
tname
trId
Inh
52
Bottom-up step synthesize attributes

treatment ? trId, tname, treatments
treatments ? treatment
Syn(treatments) U Syn(treatment)
Processing of an element e Inh(e) ? subtree(e) ?
Syn(e)

Syn(treatment) Syn(trId) ? Syn(treatments)
53
Sideways step controlled derivation

patient ? SSN, name, treatments, bill
Inh(bill) Syn(treatments)
bill ? item
Inh(item) ? Q(Inh(bill) )
select trId, price
from DB2 billing
where trId in Inh(bill) -- set
membership test
DTD-directed each step of construction follows a
production

Recall DB2 billing (trId, price)
54
Constraint compilation

Captured with rules on synthesized attributes of
patient
trIdB bag-valued, collecting trIds under item
trIdS1 set-valued, collecting trIds under
treatment
trIdS2 set-valued, collecting trIds under item
key patient ( item.trId ? item )
unique (Syn(patient).trIdB) -- no
duplicates in the bag
foreign key patient ( treatment.trId ?
item.trId )
subset ( Syn(patient).trIdS1,
Syn(patient).trIdS2)
compilation semantic rules and attributes for
constraints are automatically generated and
evaluated

55
Advantages of AIG

DTD-directed view definition automatically
ensures conformance to DTD -- recursive,
nondeterministic
Constraint compilation automatically captures
integrity constraints in a uniform framework
performance avoid post-materialization checking
optimization jointly with query evaluation
exception handling actions when constraints are
violated
Controlled derivation supports context-dependent
generation.
Information passing top-down, bottom-up,
sideways
Multi-source queries optimizer-based
decomposition
One sweep each node is visited at most twice
evaluates its inherited attribute, subtree, then
its synthesized attribute

56
Middleware evaluation of AIGs
optimizer
AIG
XML
merging
query plan execution
pre-processing
tagging
scheduling
data
query
DB3
cost statistics
DB2

pre-processing
constraint compilation
multi-source query decomposition ? single-source
queries
optimizer query plan generation using cost
statistics
execution SQL queries ? data sources, results
? mediator
tagging relational tables (paths from root) ?
XML view via merge-sorting

57
Optimization

Goal reduce response time
costs query execution, data transfer, storage
(caching)
constraints query dependency graph (DAG)
nodes queries computing inherited/synthesized
attributes
edges dependency relation (producer-consumer
relation)
recursion iterative unfolding by a certain depth
Optimization techniques
query merging
query scheduling

58
Query scheduling

Goal reduce the total response time by
increasing parallelism
Ordering execution of queries on the same site
e.g., ltQ4, Q5, Q3gt vs. ltQ3, Q5, Q4gt on DB2
Constraints
costs execution, communication, caching
overheads
dependency relation
Finding an optimal schedule NP-hard

59
Query merging

Goal reduce DB visits, leverage DB optimizer
Composition of queries on the same site
outer-join/union
Tradeoffs
large result tables with null
communication/caching cost
impact on scheduling -- changing query dependency
graph

Finding an optimal strategy for merging and
scheduling NP-hard
60
Cost-based heuristic scheduling

cost estimate given a fixed schedule, estimate
the completion time of a query Q, comp_time(Q)
statistics eval_cost(Q), size(Q), trans_cost(S,
S, size)
dependency Q cant start before comp_time(Q) if
Q -gt Q
scheduling given a fixed query dependency graph,
a heuristic based on dynamic programming to
find the most costly trailing path of each
query
sort queries to favor critical paths

61
Cost-based heuristic merging

greedy algorithm repeat until no further
improvement
merge each pair of queries on the same source
modify the query dependency graph accordingly
invoke scheduling w.r.t. the modified dependency
estimate the cost
pick the pair with the biggest improvement to
merge

Interaction between query scheduling and merging
62
AIG Summary

AIG a novel specification language
ensures DTD-conformance (recursion/nondeterminism)
captures integrity constraints in a uniform
framework
supports complex transformations controlled
derivation (context-dependency), multi-source
queries, . . .
Optimization techniques nontrivial optimization
problems
constraint compilation, multi-source query
decomposition
query scheduling w.r.t. query dependency graph
query merging and its interaction with scheduling

63
More on Commercial System MS SQL Server 2005

Annotated schema (XSD) fixed tree templates
nonrecursive schema
associate elements and attributes with table and
column names
Given a relational database, XSD populates an XML
elements/attributes with corresponding
tuples/columns
FOR-XML
An extension of SQL with an FOR-XML construct
Nested FOR-XML to construct XML documents
Summary
incapable of supporting schema-directed
publishing
cant define recursive XML views (bounded
recursion depth)

64
Commercial System IBM DB2 XML Extender

User-defined mapping through DAD (Document Access
Definition) a fixed XML tree template
(nonrecusive)
SQL mapping a single SQL query, constructing XML
trees of depth bounded by the arity of the tuples
returned and group-by
RDB node mapping a fixed tree template with
nodes annotated with conjunctive queries
SQL/XML an extension of SQL with XML
constructors (XMLAGG, XMLELEMENT, etc) as
discussed earlier
Summary
incapable of supporting schema-directed
publishing
cant define recursive XML views

65
Commercial System Oracle 10g XML DB

SQL/XML
DBMS_XMLGEN, a PL/SQL package
Supports recursive XML view definition (via
linear recursion of SQL99)
does not support schema-directed XML publishing

66
Summary and review

Why publish XML data? What is the major
difficulty?
One can publish relational data via an XQuery
view. How to do it? What are the pros and cons of
this approach?
What is schema-directed publishing? Why do we
need it?
Why does ATG automatically ensure
DTD-conformance?
How to ensure both DTD conformance and constraint
satisfaction?
Homework
Consider projects related to ATG
How to answer queries over XML views? Read papers
on query composition (SilkRoute, XPERANTO).