Title: TDD: Research Topics in Distributed Databases
1TDD Research Topics in Distributed Databases
- XML Integration
- Warehouse vs. mediator introduction
- AIG (warehouse)
- Attribute Integration Grammar schema-directed
integration - Optimization techniques
- Enosys (mediator)
- Active XML (hybrid)
2XML Integration
- Warehouse vs. mediator introduction
- distributed systems with middleware architecture
- AIG (warehouse)
- Attribute Integration Grammar schema-directed
integration - Optimization techniques
- Enosys (mediator)
- Active XML (hybrid)
3Middleware data warehouse
- data warehouse a repository of integrated
information, available for querying and analysis - data warehousing architectures, algorithms and
tools for integrating data from multiple
databases or other information sources into a
single repository - heterogeneous sources
- structured data
- object-oriented databases, relational databases,
... - semistructured data
- XML documents, Web data, ...
- unstructured data
- video, audio, ...
4Warehouse architecture
client applications
data warehouse
integrator
monitor/wrapper
monitor/wrapper
monitor/wrapper
XML
RDB
OODB
5Monitor/wrapper
- A monitor/wrapper for each data source
incrementally added - translation translate an information source into
a common integrating model - change detection detect changes to the
underlying data source and propagate the changes
to the integrator - active databases (triggers condition, event,
action) - logged sources inspecting logs
- periodic polling, periodic dumps/snapshots
- Data cleaning
- detect erroneous/incomplete information to ensure
validity - back flushing return cleaned data to the source
6integrator
- receive change notifications from the
wrapper/monitors and reflect the changes in the
data warehouse. - Typically a rule-based engine
- merging information (e.g., skolemiation)
- handling references
- Data cleaning
- removing redundancies and inconsistencies
- inserting default values
- blocking sources
7warehouse
- Data from data sources are imported into the
warehouse - the underlying data sources are still operational
- the data is replicated in the warehouse
- The warehouse data is not typically in the same
form and volume as in the underlying sources - metadata and subject-oriented for analytical
purpose (e.g., sales, marketing, finance,
distribution, ) - historical data (timespan) cover a long time
frame - multi-dimensional data cubes or hypercubes
- highly integrated and summarized derived data
- granularity roll up and drill down
- large volume of data VLDB (very large database)
8Applicability
- Problem potential inconsistency with the
sources. - Commonly used for relatively static data
- when clients require specific, predicable portion
of the available information - when clients require high query performance but
not necessarily the most recent state of the
information - when clients want summarized/aggregated
information such as historical information - examples
- scientific data
- historical enterprise data
- caching frequently requested information
9Data warehouse vs. materialized views
- materialized view is over an individual
structured database, while a warehouse is over a
collection of heterogeneous, distributed data
sources - materialized view typically has the same form as
in the underlying database, while a warehouse
stores highly integrated and summarized data - materialized view modifications occur within the
same transaction updating its underlying
database, while a warehouse may have to deal with
independent sources - sources simply report changes
- sources may not have locking capability
- integrator is loosely coupled with the sources
10Mediated system architecture
- Virtual approach data is not stored in the
middle tier
client applications
Mediator
wrapper
wrapper
wrapper
XML
RDB
OODB
11Lazy vs. eager approaches
- Lazy approach (mediated systems)
- accept a query, determine the appropriate set of
data sources, generate sub-queries for each data
source - obtain results from the data sources, perform
translation, filtering and composing, and return
the final answer -
- Eager approach (warehouses)
- information from each source that may be of
interest is extracted in advance, translated,
filtered, merged with relevant sources, and
stored in a repository - query is evaluated directly against the
repository, without accessing the original
information sources
12Data warehouse vs. mediated systems
- Efficiency
- response time at the warehouse, queries can be
answered efficiently without accessing original
data sources. Advantageous when data sources are
slow, expensive or periodically unavailable, or
when translation, filtering and merging require
significant processing - space warehousing consumes extra storage space
- Extensibility warehouse
- consistency with the sources warehouse data may
become out of date - applicability
- warehouses for high query performance and static
data - mediated systems for information that changes
rapidly
13XML Integration
- Warehouse vs. mediator introduction
- AIG (warehouse)
- Attribute Integration Grammar schema-directed
integration - Optimization techniques
- Enosys (mediator)
- Active XML (hybrid)
14Data integration in XML for data exchange
- multiple, heterogeneous data sources
multi-source queries, query decomposition, object
fusion, - distributed sources scheduling of query
execution - schema-conformance,
XML
15Schema-directed integration
data exchange
- Integration
- extract relevant data from distributed, multiple
databases - construct an XML view
- Schema-directed conformance to a predefined
schema (D, ?) - D a DTD, type constraints
- ? a set of XML integrity constraints (keys,
foreign keys)
16Example hospital and insurance company
- patient (SSN, name, policy)
- visitInfo (SSN, trId, date)
billing (trId, price)
cover (policy, trId)
daily XML report
treatment (trId, name) procedure (trId1, trId2)
- Given a date, for each patient of that day,
report - SSN, name (DB1)
- treatments (hierarchy) covered by insurance
(DB1, DB3, DB4) - cost of all and only those treatments received
(DB2)
17Predefined schema (D, ?) DTD
- report ? patient
- patient ? SSN, name,
treatments, bill - treatments ? treatment
- treatment ? trId, tname,
treatments - bill ? item
- item ? trId, price
report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
. . .
price
trId
trId
tname
treatments
18Predefined schema (D, ?) XML constraints ?
- constraints relative to each patient,
- key each treatment is charged only once
- patient ( item.trId ? item )
- foreign key every treatment has a billing record
- patient ( treatment.trId ? item.trId )
report
date
patient
patient
patient
name
treatments
SSN
bill
treatment
item
item
treatment
. . .
price
trId
treatments
19More on XML integrity constraints
- absolute on the entire document
- key country.name ? country
- relative on a subdocument rooted at a country
- key country (province.name ?
province) - foreign key country (capital.inProvince ?
province.name) - (value inclusion)
20Challenge nondeterministic structure
- certain structure cannot be decided at compile
time - recursion expansion - the depth of XML tree
- treatments ? treatment
- treatment ? trId, tname, treatments
--- recursive - choice of children in a disjunction, e.g. A ? B
C
report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
price
trId
trId
tname
treatments
. . .
unbounded --gt
21Challenge context dependency
- bill subtree all and only the trIds in the
treatments subtree - controlled derivation the bill subtree cannot be
started before the treatments subtree is
completed. - information passing downward, upward, sideways
date
report
patient
date
SSN
treatments
name
SSN
bill
trIdS
trId
trId
unbounded
22Challenges
- DTD-conformance recursive, nondeterministic
- integrity constraints validation during
document generation - multi-source queries a single one involves
several databases - context-dependency not strictly top-down or
bottom-up - Previous work
- XML publishing single data source, no
constraints, top-down. - XML integration little support for XML
schema-conformance. - XML query languages type checking provide no
guidance for how to ensure schema-conformance
optimization hard.
Schema-directed XML integration is nontrivial!
23Middleware for schema-directed integration
view definition language
optimization techniques
- A lightweight language Attribute Integration
Grammar (AIG) - Cost-based optimization in light of
context-dependency
DTD constraints
semantic attributes
semantic rules
24Attribute Integration Grammar (AIG)
- DTD element type definitions e ? ?
- ? PCDATA ? e1, , en
e1 en e - Attributes associated with each element type e
- Inh(e) inherited from parent/siblings
(top-down/sideways) - Syn(e) synthesized from children (bottom-up)
- Syn(e), Inh(e) tuple or set/bag-valued
- Rules associated with each production e ? ?, for
e in ? - Inh(child) Inh(e) Q(Inh(parent),
Syn(sibling) ) - Syn(parent) Syn(e) U (Syn(children))
-- union - Q multi-source SQL query with parameters
- Dependency e2 must be evaluated before e1 if
- Inh(e1) Q( Syn(e2) ) -- acyclic graph
(DAG)
25AIG semantics conceptual evaluation
- following the dependency ordering starting from
the root - report ? patient
- Inh(patient) ? Q1 (Inh(report))
- select Inh(report) as date, p.SSN,
p.name, p.policy - from DB1 patient p, DB1
visitInfo v - where p.SSN v.SSN and v.date
Inh(report) - Recall DB1 patient (SSN, name, policy),
visitInfo (SSN, trId, date) - Parameter in a query Inh(report) as a constant
- Data driven the number of patients depends on Q1
report
Inh
patient
patient
patient
26Multi-source query
- patient ? SSN, name, treatments, bill
- Inh(SSN) Inh(patient).SSN, . . . ,
- Inh(treatments) Q2(v) -- v
Inh(patient) - select t.trId, t.tname
- from DB1 visitInfo i, DB3 cover
c, DB4 treatment t - where i.SSN v.SSN and i.date
v.date and t.trId i.trId - and c.trId i.trId and c.policy
v.policy - a single query uses DB1, DB3 and DB4
- tuple- and set-valued attributes (Inh(SSN),
Inh(treatments))
- Recall DB1 patient(SSN, name, policy),
visitInfo(SSN, trId, date) - DB3 cover (policy, trId)
- DB4 treatment (trId, name),
procedure (trId1, trId2)
27Initial top-down pass context-dependent
- patient ? SSN, name, treatments, bill
- Inh(SSN) Inh(patient).SSN, Inh(name)
Inh(patient).name, - Inh(treatments) Q2(Inh(patient))
- Inh(bill) Syn(treatments) lt--
halt - DTD-directed generate children following the
production - Inh(bill) defined with sibling --
Syn(treatments), dependency ordering evaluate
bill after treatments
lt- - halt
28Initial top-down pass recursion
- treatments ? treatment
- Inh(treatment) ? Inh(treatments) -- set of
(trId, tname) - Data driven treatments expansion depends on
Inh(treatments) - empty expansion terminates, Syn(treatments) is
empty - nonempty expands.
29Leaf step
- treatment ? trId, tname, treatments
- Inh(trId) Inh(treatment).trId, . . .,
-
- trId ? PCDATA
Syn(trId) Inh(trId)
treatments
Inh (trId, tname)
treatment
treatment
treatments
tname
trId
Inh
30Bottom-up step synthesize attributes
- treatment ? trId, tname, treatments
- treatments ? treatment
- Syn(treatments) U Syn(treatment)
- Processing of an element e Inh(e) ? subtree(e) ?
Syn(e)
Syn(treatment) Syn(trId) ? Syn(treatments)
31Sideways step controlled derivation
- patient ? SSN, name, treatments, bill
- Inh(bill) Syn(treatments)
- bill ? item
- Inh(item) ? Q(Inh(bill) )
- select trId, price
- from DB2 billing
- where trId in Inh(bill) -- set
membership test - DTD-directed each step of construction follows a
production
Recall DB2 billing (trId, price)
32Constraint compilation
- Captured with rules on synthesized attributes of
patient - trIdB bag-valued, collecting trIds under item
- trIdS1 set-valued, collecting trIds under
treatment - trIdS2 set-valued, collecting trIds under item
- key patient ( item.trId ? item )
- unique (Syn(patient).trIdB) -- no
duplicates in the bag - foreign key patient ( treatment.trId ?
item.trId ) - subset ( Syn(patient).trIdS1,
Syn(patient).trIdS2) - compilation semantic rules and attributes for
constraints are automatically generated and
evaluated
33Advantages of AIG
- DTD-directed view definition automatically
ensures conformance to DTD -- recursive,
nondeterministic - Constraint compilation automatically captures
integrity constraints in a uniform framework - performance avoid post-materialization checking
- optimization jointly with query evaluation
- exception handling actions when constraints are
violated - Controlled derivation supports context-dependent
generation. - Information passing top-down, bottom-up,
sideways - Multi-source queries optimizer-based
decomposition - One sweep each node is visited at most twice
evaluates its inherited attribute, subtree, then
its synthesized attribute
34Middleware evaluation of AIGs
optimizer
AIG
XML
merging
query plan execution
pre-processing
tagging
scheduling
data
query
DB3
cost statistics
DB2
- pre-processing
- constraint compilation
- multi-source query decomposition ? single-source
queries - optimizer query plan generation using cost
statistics - execution SQL queries ? data sources, results
? mediator - tagging relational tables (paths from root) ?
XML view via merge-sorting
35Optimization
- Goal reduce response time
- costs query execution, data transfer, storage
(caching) - constraints query dependency graph (DAG)
- nodes queries computing inherited/synthesized
attributes - edges dependency relation (producer-consumer
relation) - recursion iterative unfolding by a certain depth
- Optimization techniques
- query merging
- query scheduling
36Query scheduling
- Goal reduce the total response time by
increasing parallelism - Ordering execution of queries on the same site
- e.g., ltQ4, Q5, Q3gt vs. ltQ3, Q5, Q4gt on DB2
- Constraints
- costs execution, communication, caching
overheads - dependency relation
- Finding an optimal schedule NP-hard
37Query merging
- Goal reduce DB visits, leverage DB optimizer
- Composition of queries on the same site
outer-join/union - Tradeoffs
- large result tables with null
communication/caching cost - impact on scheduling -- changing query dependency
graph
Finding an optimal strategy for merging and
scheduling NP-hard
38Cost-based heuristic scheduling
- cost estimate given a fixed schedule, estimate
the completion time of a query Q, comp_time(Q) - statistics eval_cost(Q), size(Q), trans_cost(S,
S, size) - dependency Q cant start before comp_time(Q) if
Q -gt Q - scheduling given a fixed query dependency graph,
a heuristic based on dynamic programming to - find the most costly trailing path of each
query - sort queries to favor critical paths
39Cost-based heuristic merging
- greedy algorithm repeat until no further
improvement - merge each pair of queries on the same source
- modify the query dependency graph accordingly
- invoke scheduling w.r.t. the modified dependency
- estimate the cost
- pick the pair with the biggest improvement to
merge
Interaction between query scheduling and merging
40Preliminary experimental study
Card(3-way self join) 4055 Card(4-way self
join) 6837
- Data set running example insurance/hospital
- recursion unfolding (depth) 3, 4, 5 --
treatments - data size small, medium, large generated via
ToXgene - System
- DB2 v8.1 for Linux
- simulation of data transfer bandwidth1Mbps
41Experimental results ratio of improvement
438/193
61/34
42AIG Summary
- AIG a novel specification language
- ensures DTD-conformance (recursion/nondeterminism)
- captures integrity constraints in a uniform
framework - supports complex transformations controlled
derivation (context-dependency), multi-source
queries, . . . - Optimization techniques nontrivial optimization
problems - constraint compilation, multi-source query
decomposition - query scheduling w.r.t. query dependency graph
- query merging and its interaction with scheduling
A systematic framework for schema-directed XML
integration
43XML Integration
- Warehouse vs. mediator introduction
- AIG (warehouse)
- Attribute Integration Grammar schema-directed
integration - Optimization techniques
- Enosys (mediator)
- Active XML (hybrid)
44Enosys a mediated system
- XMLizer wrapper, converting source to virtual
XML view - Mediator export virtual integrated XML (VIX)
database - Translator rewrites XML queries to intermediate
algebraic exps - Rewriter decompose
- multi-source queries
- Optimizer query plan
- generation
- Execution sends
- requests to wrappers,
- compose results,
- tagging,
query
Mediator
Translator
Rewriter/optimizer
Execution engine
XMLizer
XMLizer
XMLizer
source
source
source
45Active XML a hybrid of warehouse and mediator
- XML doc template a mix of
- data nodes -- materialization
- function nodes embedded calls to
- Web services, queries -- virtual
- Integration
- data nodes concrete data (static)
- function nodes extract up-to-date data (dynamic)
- Data exchange materialization before/after
sending the doc - After smaller doc (less transmission cost) --
Web services can be accessed from the receiver
site - Before for security (access control) and
capability reasons - Optimization very hard
- Target-schema conformance? Fixed doc template
46Summary and review
- What are main approaches to integrating XML data?
What are the major difficulties? - For what applications warehousing is preferable
to a mediator based approach? - Understand AIG, Enosys and Active XML
- Given an AIG and relational sources, you should
be able to - understand how AIG works by providing the
integrated XML data - understand the optimization and evaluation
process - Can you combine Active XML and ATG to ensure
target schema-conformance? - Find and read papers about the GUI of Enosys
- Find and read papers on data integration based on
Active XML