Title: DTDDirected Publishing and Integration
1DTD-Directed Publishing and Integration
- Wenfei Fan
- Internet Research Dept. Bell Laboratories
- Dept. of CIS, Temple University
2Outline
- XML publishing
- DTD-directed publishing VLDB02
- Attribute Transformation Grammars (ATGs)
- PRATA middleware based on ATGs
- Schema-directed integration ongoing work
- Coping with both type and integrity constraints
- Attribute Integration Grammar (ATG)
- Research problems
- Joint work with
- Michael Benedikt, Chen Yong Chan, Rajeev
Rastogi, Bell Labs - Aoying Zhou, Shihui Zheng, Fudan University
3Outline
- XML publishing
- DTD-directed publishing VLDB02
- Attribute Transformation Grammars
- PRATA middleware based on ATGs
- Schema-directed integration ongoing work
- Coping with both type and integrity constraints
- Attribute Integration Grammar (ATG)
- Research problems
4The need for XML publishing
- Most data is stored in existing databases
- Web data is often managed by DBMS facilities
- Data needs to be exported to XML -- the prime
standard for data exchange
XML
XML
DB
DB
5Simple XML publishing
- Export relational data in a canonical format
- Patient
- ltpatientgt
- lttuplegt ltSSNgt 123 lt/gt ltnamegt Joe lt/gt
ltpolicygt LU23 lt/gt - lt/tuplegt
- lttuplegt ltSSNgt 345 lt/gt ltnamegt Mary lt/gt
ltpolicygt AT45 lt/gt - lt/tuplegt
- lt/patientgt
6A little harder insurance company and hospital
- Daily report
-
- Relational database R at the hospital
- Patient (SSN, name, tname, policy)
- treatment (SSN, tname)
- ltreportgt
- ltpatientgt ltSSNgt 123 lt/gt ltnamegt Joe lt/gt
ltpolicygt LU23 lt/gt - lttnamegt MRI lt/gt . . .
lttnamegt EEG lt/gt - lt/patientgt
- . . .
- lt/reportgt
hospital
insurance company
XML view
R
XML
grouping treatments under each patient
7XML publishing
- An XML view definition language specifying
desired mapping - Efficient implementation of the view language
Q XML view
8Existing XML publishing tools (1)
- Embedding single SQL query in XSL stylesheet
- Result canonical, flat XML
- Commercial system
- Oracle 9i XML SQL facilities
- DB2
- Incapable of expressing practical XML publishing
default, flat XML
9Existing XML publishing tools (2)
- Extending SQL with XML constructors
- select XML-aggregation
- from R1, . . ., Rn
- where conditions
- Tools
- Commercial system IBM DB2 XML extender
- Middleware (vendor-independent) XPERANTO
- Verbose and cumbersome
- small document tedious
- large documents unthinkable
10Existing XML publishing tools (3)
- Annotated template embedding SQL in a fixed XML
tree - Middleware SilkRoute
- Commercial SQL Server 2000, IBM DB2 DAD
- Advantages
- More modular comparing to the universal
relation approach - Limited schema-driven conforming to a fixed doc
template
11Getting real data exchange on the Web
- All members of a community (industry) agree on a
DTD and then exchange data w.r.t. it
e-commerce, health-care, ... - XML Publishing
- mapping relational data to XML
- conforming to the predefined DTD
Web
DTD
XML
XML
Q XML view
DB1
DB2
12Real-life example insurance company and hospital
- Daily report
-
- Relational database R at the hospital
- Patient (SSN, name, tname, policy, date)
- inTreatment (tname, cost)
- outTreatment (tname, referral)
- Procedure (tname1, tname2)
- treatment
- in hospital composition hierarchy in Procedure
- outside of the hospital referral
hospital
insurance company
XML view
R
XML
13Example insurance company and hospital
- DTD D predefined by the insurance company
- report ? patient
- patient ? SSN, pname,
treatment, policy - treatment ? tname, (inTreatment
outTreatment) - inTreatment ? treatment
- outTreatment ? referral
- How to define a mapping ? such that for any
instance DB of R, - ? (DB) is an XML document containing all the
patients and their treatments (hierarchy,
referral) from DB, - ? (DB) conforms to D?
14Challenge recursive type
- XML data unbounded depth -- cannot be decided
statically - treatment ? tname, (inTreatment
outTreatment) - inTreatment ? treatment ---
recursive
SSN
123
15Challenge non-determinism
- The choice of a production (element type
definition) - treatment ? tname, (inTreatment outTreatment)
- -- depends on the underlining relational data
report
16Existing systems
- fixed XML tree template or ignoring
DTD-conformance - middleware SilkRoute (ATT), XPERANTO (IBM),
- systems SQL Server 2000, IBM DB2 XML extender,
- incapable of coping with a predefined DTD (e.g.
recursion) - type checking define a view and then check its
conformance - undecidable in general, co-NEXPTIME for extremely
restricted view definitions - no guidance on how to define XML views that
typecheck - one gets an XML view that typechecks only after
repeated failures and with luck
17Outline
- XML publishing
- DTD-directed publishing VLDB02
- Attribute Transformation Grammars
- PRATA middleware based on ATGs
- Schema-directed integration ongoing work
- Coping with both type and integrity constraints
- Attribute Integration Grammar (ATG)
- Research problems
18Attribute Translation Grammar (ATG)
- DTD normalized element type definitions e ? ?
- ? PCDATA ? e1, , en e1
en e - Attributes e associated with each element type
e - e tuple-valued, to pass data value as well as
control - Rules associated with each e ? ? for e in ?,
e Q(e) - SQL query Q extracts data from DB
- parent attribute e as a constant parameter in Q
19Semantics conceptual evaluation
- Top-down
- report ? patient
- patient ? select SSN, name, tname,
policy - from Patient --- SQL
query - recall Patient (SSN, name, tname, policy)
- Data-driven a patient element for each tuple in
Patient relation
patient
patient
patient
patient
20Inherited attributes
- Inherited child is computed using parent
- patient ? SSN, name, treatment, policy
- SSN patient.SSN, name
patient.name - treatment patient.tname policy
patient.policy - recall patient (SSN, name, tname, policy)
- SSN ? PCDATA
- PCDATA SSN
report
...
patient
patient
patient
patient
patient
treatment
name
policy
SSN
SSN
123
Joe
LU23
PCDATA
21Coping with non-determinism
- treatment ? tname, (inTreatment
outTreatment) - tname treatment
- (inTreatment, outTreatment)
- case Qc(treatment).tag
--- conditional query - 1 (treatment, null)
- else (null, treatment)
- Qc select 1 as tag from inTreatment where
tname treatment - conditional query the choice of production
- parent as constant parameter in SQL query
...
treatment
treatment
tname
tname
inTreatment
outTreatment
22Coping with recursion
- inTreatment ? treatment
- treatment ? select tname2
- from Procedure
- where inTreatment tname1
- recall Procedure (tname1, tname2)
- parent as constant parameter in SQL query Q
- inTreatment is further expanded as long as Q(DB)
is nonempty
treatment
treatment
23DTD-directed publishing with ATGs
- DTD-directed the XML tree is constructed
strictly following the productions of a DTD ---
DTD conformance - Data-driven the choice of productions and
expansion of the XML tree (recursion) depends on
relational data
report
...
patient
patient
patient
patient
treatment
name
SSN
inTreatment
...
tname
Joe
123
treatment
treatment
...
...
24ATGs vs. existing systems
- DTD-conformance
- ATGs provide guidance for how to define
DTD-directed publishing - Other systems based on a fixed tree template
- Expressive power strictly more expressive than
others - ATGs capable of expressing XML views supported
by other systems - Other systems cannot handle recursion/nondetermin
ism - ATG the first systematic method to ensure
DTD-conformance
25ATGs vs. Attribute Grammars (AGs)
- AGs
- Definition w.r.t. a CFG
- Evaluation parse a string with the CFG, then
evaluate attributes given the parse tree - ATGs combining DTD and database operations
- Definition w.r.t. an ECFG (DTD) and SQL queries
- Evaluation given DB, extract relevant data from
DB with SQL queries to build an XML tree of the
DTD - It does not make sense to parse a database
w.r.t. a DTD - ATGs are not a mild variation of AGs
26Outline
- XML publishing
- DTD-directed publishing VLDB02
- Attribute Transformation Grammars
- PRATA middleware based on ATGs
- Schema-directed integration ongoing work
- Coping with both type and integrity constraints
- Attribute Integration Grammar (ATG)
- Research problems
27PRATA middleware based on ATGs
query plan generation, evaluation
relations
ATG graph
XML
ATG
tagging
parsing
cost estimate query
statistics query results
- ATG graph representing the ATG
- relations representing root-leaf paths ---gt
- tagging one pass
28Evaluation of ATGs
- Conceptual evaluation defining semantics, but
not efficient - Techniques proved useful
- partitioning reduce db visits, make use of DBMS
optimizer (handling annotated templates
complete tree) - Relations representing XML trees root-leaf paths
- Separate tagging
29ATG graph
- edge labels SQL queries
- cyclic introduced by recursion -- not in
other systems - iterative unfolding partial ATG tree to a
depth d
report
Q1
patient
Q5
Q2
policy
Q3
Q4
Q8
name
SSN
treatment
Q7
Qc
tname
outTreatment
inTreatment
Q9
Q6
referral
procedure
30Partitioning partial ATG trees
- query composition (outer join)
- challenge how to partition into clusters?
- finer ? fewer nulls, smaller queries sent to
DBMS - coarser ? leverage DBMS optimizer, less DB visits
- Finding an optimal strategy NP-hard
report
composed Q
Q1
patient
Q2
Q5
Q3
SSN
policy
Q4
name
composed Q
treatment
Qc
Q7
outTreatment
tname
inTreatment
Q9
Q6
procedure
referral
...
Q8
treatment
31Materialization of intermediate results
- Partitioning root-leaf relation R ?
intermediate R1, R2, R3, R4 - repeated computation each Ri carrying a key from
the root - materialization storing keys in temporary
tables - cons overhead with temporary tables
- pros reducing recomputation in descendant
queries - What keys to materialize? Not supported by other
systems
output relation R
R1
sort merge
key
R2
R3
sort merge
key
R4
32Query plan generation
- New challenge find best partitioning and
materialization - mutually dependent partitioning and
materialization must be taken together - expensive exponentially many choices NP-hard
- Efficient cost-based heuristic
- DBMS statistics arity/cardinality of a table,
cost to evaluate a query/create temporary table - cost estimate simple formulae for computing the
benefit of materialization/partitioning --
effective in practice - optimal strategy -- dynamic programming
decides partitioning w.r.t. the benefit of
materialization
33Experimental evaluation
- benchmark
- enhancement of TPC-H part/supplier relational
data - scale factor 0.1
- system
- 1.4Gz Pentium IV, 256M memory, Windows 2000
- DBMS Oracle 9i with a JDBC interface
- Middleware implemented in Java
34Experimental result
Execution Time (sec)
35Summary ATG
- ATGs
- ensure DTD-conformance automatically
- support recursion/non-determinism naturally
- simple no need for a new language knowing DTD
SQL is all that one needs for writing an ATG - PRATA middleware based on ATGs
- novel technique of combining partitioning and
materialization - a practical solution for DTD-directed publishing
- ATG the first systematic way for DTD-directed
publishing
36Outline
- XML publishing
- DTD-directed publishing VLDB02
- Attribute Transformation Grammars
- PRATA middleware based on ATGs
- Schema-directed integration ongoing work
- Coping with both type and integrity constraints
- Attribute Integration Grammar (ATG)
- Research problems
37Data integration in XML
- multiple, heterogeneous data sources
- integrated data conforming to a schema DTD
integrity constraints - existing systems the lack of schema-conformance
38More about hospital and insurance company
- 3 databases in the hospital
- Patient (SSN, name, trId, policy, date)
- Treatment (SSN, trId)
- Treatment (trId, name)
- Procedure (trId1, trId2)
- Billing (trId, cost)
- Account (SSN, date, amount)
-
39Daily report in XML to insurance company
- Given a particular date, report
- information of all the patients on that date
- SSN, name, policy
- treatments grouped under each patient
- description of the treatments of these patients
hierarchy - cost of all these treatments
- Remark
- Multiple data sources
- Information passing
40XML report
date
report
patientInfo
trIdS1
. . .
treatment
patient
patient
. . .
trId
medication
SSN
. . .
trId
trId
DB1
unbounded
41DTD predefined by the insurance company
- report ? patientInfo,
treatmentInfo, costInfo - patientInfo ? patient
- patient ? SSN, pname,
policy, medication - medication ? trId
- treatmentInfo ? treatment
- treatment ? trId, tname,
procedure - procedure ? treatment
- costInfo ? trCost
- trCost ? trId, cost
42Integrity constraints
- Unary keys, foreign keys
- Key trCost.trId ? trCost
- each treatment is charged only once
- Foreign key treatment.trId ? trCost.trId
- every treatment has cost information reported
- Why not XML-Schema constraints?
- Consistency undecidable one cant determine
whether a schema (DTD constraints) has a
document instance PODS01 - Other technical problems DEXA02
43challenges
- Multiple data sources
- Integrity constraints
- Data dependency controlled derivation
- synthesized attributes bottom-up
- inherited attributes top-down
- set/bag-valued attributes
44Outline
- XML publishing
- DTD-directed publishing VLDB02
- Attribute Transformation Grammars
- PRATA middleware based on ATGs
- Schema-directed integration ongoing work
- Coping with both type and integrity constraints
- Attribute Integration Grammar (ATG)
- Research problems
45Attribute Integration Grammar (ATG)
- Attributes associated with each element type e
- Syn(e) synthesized from children
- Inh(e) inherited from parent and siblings
- Syn(e), Inh(e) tuple or set/bag-valued
- Rules associated with each e ? ?, for e in ?
- Syn(Parent) Syn(e) f(Syn(children),
Inh(e)) - Inh(child) Inh(e) Q(Inh(e),
Syn(sibling) )
ATG
DTD
semantic rules
attributes
46attributes
- Root
- Inh(report) (date) -- the parameter of the
ATG - Syn(report) (tKey bag, trIdS set) for
constraints - patientInfo
- Inh(patientInfo) (date) --- from parent
- Syn(patientInfo) (trIdS1 set) collection of
trIds - treatmentInfo
- Inh(treatmentInfo) (trIdS1 set) --- from
sibling - Syn(treatmentInfo) (trIdS2 set) collection
of trIds - costInfo
- Inh(costInfo) (trIdS2 set) --- from sibling
- Syn(costInfo) (tKey bag) collection of trIds
47Dependency relation
- element type e depends on e
- iff Inh(e) Q ( Syn(e) )
- e must be evaluated before e
- requirement the dependency relation is acyclic
- report ? patientInfo, treatmentInfo,
costInfo - Inh(treatmentInfo).trIdS1 Syn(patientInfo).tr
IdS1 - Inh(costInfo).trIdS2 Syn(treatmentInfo).t
rIdS2 - Inh(patientInfo).date Syn(report).date
report
date
patientInfo
treatmentInfo
costInfo
trIdS1
trIdS2
trId
trId
trId
48Synthesized attributes
- report ? patientInfo, treatmentInfo,
costInfo - Syn(report).trIdS Syn(treatmentInfo).trId
S2 - Syn(report).tKey Syn(costInfo).tKey
report
tKey
trIdS2
patientInfo
treatmentInfo
costInfo
trId
trId
trId
49Inherited attributes of the first subtree
- patientInfo ? patient
- Inh(patient) ? Q1(DB1)
- select SSN, name, policy
- from Patient
- where date Inh(patientInfo).date
- patient ? SSN, pname, policy, medication
- Inh(medication) Inh(patient).SSN
- . . .
- medication ? trId
- Inh(trId) ? Q2(DB1)
- select trId
- from Treatment
- where SSN Inh(medication).SSN
patientInfo
. . .
patient
patient
. . .
medication
. . .
trId
trId
DB1
50Synthesized attributes of the first subtree
- patientInfo ? patient
- Syn(patientInfo).trIdS1 ? Syn(patient).trIdS
- patient ? SSN, pname, policy, medication
- Syn(patient).trIdS Syn(medication).trIdS
-
- medication ? trId
- Syn(medication).trIdS ? Syn(trId).trId
51Inherited attributes of the second subtree
- treatmentInfo ? treatment
- Inh(treatment) ? Inh(treatmentInfo).trIdS1
- treatment ? trId, tname, procedure
- Inh(tname) ? Q3(DB2)
- select name
- from Treatment
- where trId Inh(treatment).trId
- . . .
- procedure ? treatment
- Inh(treatment) ? Q4(DB2)
- select trId2
- from Procedure
- where trId1
Inh(procedure).trId
treatmentInfo
treatment
treatment
. . .
. . .
procedure
treatment
treatment
. . .
trId
trId
DB2
52Synthesized attributes of the second subtree
- treatmentInfo ? treatment
- Syn(treatmentInfo).trIdS2
- Inh(treatmentInfo).trIdS1 ? ?Syn(treatment).trI
dS - treatment ? trId, tname, procedure
- Syn(treatment).trIdS Syn(procedure).trIdS
- procedure ? treatment
- Syn(procedure).trIdS ? Syn(treatment).trId
treatment
53Inherited attributes of the third subtree
- costInfo ? trCost
- Inh(trCost) ? Inh(costInfo).trIdS2
- trCost ? trId, cost
- Inh(cost) ? Q3(DB2)
- select cost
- from Billing
- where trId Inh(trCost).trId
- . . .
costInfo
. . .
trCost
trCost
cost
trId
DB3
54Inherited attributes of the third subtree
- costInfo ? trCost
- Syn(costInfo).tKey ? Syn(trCost).tKey
-- bag union - trCost ? trId, cost
- Syn(trCost).tKey Inh(trCost).trId
-- cast a bag to a set
55Controlled derivation
- data dependency construction of one part of a
document depends on other parts - production refinement n trIdS2
- costInfo ? trCost ? costInfo ?
trCostn - Data driven ATGregular tree grammar
ATGDTD
56Constraint compilation
- Key trCost.trId ? trCost
- unique (report.tKey) -- no
duplicates in the bag - Foreign key treatment.trId ? trCost.trId
- report.trIdS ? set (report.tKey)
- recall report ? patientInfo,
treatmentInfo, costInfo - Syn(report).trIdS Syn(treatmentInfo).t
rIdS2 - Syn(report).tKey Syn(costInfo).tKey
- Semantic rules and attributes related to
constraints are automatically generated --
compilation
57Summary ATG
- ATG the first system-directed XML integration
mechanism - integrating multiple data sources
- capturing both types and constraints in a uniform
way - controlled derivation
- constraint compilation
- relationally complete on multiple data sources
- lightweight no need for explicit wrappers
58Outline
- XML publishing
- DTD-directed publishing VLDB02
- Attribute Transformation Grammars
- PRATA middleware based on ATGs
- Schema-directed integration ongoing work
- Coping with both type and integrity constraints
- Attribute Integration Grammar (ATG)
- Research problems
59Schema-directed XML to XML transformation
- Mapping between XML documents
- XTG XML Transformation Grammar
- DTD augmented with XQuery
- Constraint-compilation
XTG
60Capturing more expressive integrity constraints
- Expressive constraints multi-attribute keys,
foreign keys, inverse constraints, and
non-classical ones PODS02,JACM, JCSS,TOCL - Difficulty consistency analyses undecidable
PODS01
DTD
XML ATG
XML
DB
DB
Web
61Combining XML query and view
- Implementing XQeury with SQL and ATG
- plausible ATG supports recursion
- motivation XML data is often stored in DBMS
- Open question
- what fragment of XQuery ATG SQL?
- incremental evaluation?
XQuery
XML ATG
SQL
data
DB
62Updating XML via relational DBMS
update
- mapping updates to the underlining DBMS via ATG
- motivation making use of DBMS transaction
control - incremental maintenance
- constraint propagation XML constraints --gt
relations - normalizing the relational storage of XML data
- consistency maintenance
DTD
XML ATG
SQL
data
functional dependencies
DB
63Ontology integration
- Ontology a combination of
- definition hierarchy
- inheritance hierarchy
- Integration
- schema mapping information preservation
- data mapping
- Applications
- semantic Web
- scientific databases
64Summary
- ATG the first DTD-directed publishing mechanism
- ATG the first system-directed XML integration
mechanism - Research topics
- many related questions remain open
- these problems are hard yet important
- the idea of ATG/ATG shed light on solving these
problems - ? research projects, papers, systems, . . .