TDD: Research Topics in Distributed Databases - PowerPoint PPT Presentation

About This Presentation

Title:

TDD: Research Topics in Distributed Databases

Description:

data warehouse: a repository of integrated information, ... multi-dimensional: data cubes or hypercubes. highly integrated and summarized: derived data ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 47

Provided by: infor178

Category:

more less

Transcript and Presenter's Notes

Title: TDD: Research Topics in Distributed Databases

1
TDD Research Topics in Distributed Databases

XML Integration
Warehouse vs. mediator introduction
AIG (warehouse)
Attribute Integration Grammar schema-directed
integration
Optimization techniques
Enosys (mediator)
Active XML (hybrid)

2
XML Integration

Warehouse vs. mediator introduction
distributed systems with middleware architecture
AIG (warehouse)
Attribute Integration Grammar schema-directed
integration
Optimization techniques
Enosys (mediator)
Active XML (hybrid)

3
Middleware data warehouse

data warehouse a repository of integrated
information, available for querying and analysis
data warehousing architectures, algorithms and
tools for integrating data from multiple
databases or other information sources into a
single repository
heterogeneous sources
structured data
object-oriented databases, relational databases,
...
semistructured data
XML documents, Web data, ...
unstructured data
video, audio, ...

4
Warehouse architecture
client applications
data warehouse
integrator
monitor/wrapper
monitor/wrapper
monitor/wrapper
XML
RDB
OODB
5
Monitor/wrapper

A monitor/wrapper for each data source
incrementally added
translation translate an information source into
a common integrating model
change detection detect changes to the
underlying data source and propagate the changes
to the integrator
active databases (triggers condition, event,
action)
logged sources inspecting logs
periodic polling, periodic dumps/snapshots
Data cleaning
detect erroneous/incomplete information to ensure
validity
back flushing return cleaned data to the source

6
integrator

receive change notifications from the
wrapper/monitors and reflect the changes in the
data warehouse.
Typically a rule-based engine
merging information (e.g., skolemiation)
handling references
Data cleaning
removing redundancies and inconsistencies
inserting default values
blocking sources

7
warehouse

Data from data sources are imported into the
warehouse
the underlying data sources are still operational
the data is replicated in the warehouse
The warehouse data is not typically in the same
form and volume as in the underlying sources
metadata and subject-oriented for analytical
purpose (e.g., sales, marketing, finance,
distribution, )
historical data (timespan) cover a long time
frame
multi-dimensional data cubes or hypercubes
highly integrated and summarized derived data
granularity roll up and drill down
large volume of data VLDB (very large database)

8
Applicability

Problem potential inconsistency with the
sources.
Commonly used for relatively static data
when clients require specific, predicable portion
of the available information
when clients require high query performance but
not necessarily the most recent state of the
information
when clients want summarized/aggregated
information such as historical information
examples
scientific data
historical enterprise data
caching frequently requested information

9
Data warehouse vs. materialized views

materialized view is over an individual
structured database, while a warehouse is over a
collection of heterogeneous, distributed data
sources
materialized view typically has the same form as
in the underlying database, while a warehouse
stores highly integrated and summarized data
materialized view modifications occur within the
same transaction updating its underlying
database, while a warehouse may have to deal with
independent sources
sources simply report changes
sources may not have locking capability
integrator is loosely coupled with the sources

10
Mediated system architecture

Virtual approach data is not stored in the
middle tier

client applications
Mediator
wrapper
wrapper
wrapper
XML
RDB
OODB
11
Lazy vs. eager approaches

Lazy approach (mediated systems)
accept a query, determine the appropriate set of
data sources, generate sub-queries for each data
source
obtain results from the data sources, perform
translation, filtering and composing, and return
the final answer
Eager approach (warehouses)
information from each source that may be of
interest is extracted in advance, translated,
filtered, merged with relevant sources, and
stored in a repository
query is evaluated directly against the
repository, without accessing the original
information sources

12
Data warehouse vs. mediated systems

Efficiency
response time at the warehouse, queries can be
answered efficiently without accessing original
data sources. Advantageous when data sources are
slow, expensive or periodically unavailable, or
when translation, filtering and merging require
significant processing
space warehousing consumes extra storage space
Extensibility warehouse
consistency with the sources warehouse data may
become out of date
applicability
warehouses for high query performance and static
data
mediated systems for information that changes
rapidly

13
XML Integration

Warehouse vs. mediator introduction
AIG (warehouse)
Attribute Integration Grammar schema-directed
integration
Optimization techniques
Enosys (mediator)
Active XML (hybrid)

14
Data integration in XML for data exchange

multiple, heterogeneous data sources
multi-source queries, query decomposition, object
fusion,
distributed sources scheduling of query
execution
schema-conformance,

XML
15
Schema-directed integration

data exchange

Integration
extract relevant data from distributed, multiple
databases
construct an XML view
Schema-directed conformance to a predefined
schema (D, ?)
D a DTD, type constraints
? a set of XML integrity constraints (keys,
foreign keys)

16
Example hospital and insurance company

patient (SSN, name, policy)
visitInfo (SSN, trId, date)

billing (trId, price)
cover (policy, trId)
daily XML report
treatment (trId, name) procedure (trId1, trId2)

Given a date, for each patient of that day,
report
SSN, name (DB1)
treatments (hierarchy) covered by insurance
(DB1, DB3, DB4)
cost of all and only those treatments received
(DB2)

17
Predefined schema (D, ?) DTD

report ? patient
patient ? SSN, name,
treatments, bill
treatments ? treatment
treatment ? trId, tname,
treatments
bill ? item
item ? trId, price

report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
. . .
price
trId
trId
tname
treatments
18
Predefined schema (D, ?) XML constraints ?

constraints relative to each patient,
key each treatment is charged only once
patient ( item.trId ? item )
foreign key every treatment has a billing record
patient ( treatment.trId ? item.trId )

report
date
patient
patient
patient
name
treatments
SSN
bill
treatment
item
item
treatment
. . .
price
trId
treatments
19
More on XML integrity constraints

absolute on the entire document
key country.name ? country
relative on a subdocument rooted at a country
key country (province.name ?
province)
foreign key country (capital.inProvince ?
province.name)
(value inclusion)

20
Challenge nondeterministic structure

certain structure cannot be decided at compile
time
recursion expansion - the depth of XML tree
treatments ? treatment
treatment ? trId, tname, treatments
--- recursive
choice of children in a disjunction, e.g. A ? B
C

report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
price
trId
trId
tname
treatments
. . .
unbounded --gt
21
Challenge context dependency

bill subtree all and only the trIds in the
treatments subtree
controlled derivation the bill subtree cannot be
started before the treatments subtree is
completed.
information passing downward, upward, sideways

date
report
patient
date
SSN
treatments
name
SSN
bill
trIdS
trId
trId
unbounded
22
Challenges

DTD-conformance recursive, nondeterministic
integrity constraints validation during
document generation
multi-source queries a single one involves
several databases
context-dependency not strictly top-down or
bottom-up
Previous work
XML publishing single data source, no
constraints, top-down.
XML integration little support for XML
schema-conformance.
XML query languages type checking provide no
guidance for how to ensure schema-conformance
optimization hard.

Schema-directed XML integration is nontrivial!
23
Middleware for schema-directed integration

view definition language
optimization techniques

A lightweight language Attribute Integration
Grammar (AIG)
Cost-based optimization in light of
context-dependency

DTD constraints
semantic attributes

semantic rules
24
Attribute Integration Grammar (AIG)

DTD element type definitions e ? ?
? PCDATA ? e1, , en
e1 en e
Attributes associated with each element type e
Inh(e) inherited from parent/siblings
(top-down/sideways)
Syn(e) synthesized from children (bottom-up)
Syn(e), Inh(e) tuple or set/bag-valued
Rules associated with each production e ? ?, for
e in ?
Inh(child) Inh(e) Q(Inh(parent),
Syn(sibling) )
Syn(parent) Syn(e) U (Syn(children))
-- union
Q multi-source SQL query with parameters
Dependency e2 must be evaluated before e1 if
Inh(e1) Q( Syn(e2) ) -- acyclic graph
(DAG)

25
AIG semantics conceptual evaluation

following the dependency ordering starting from
the root
report ? patient
Inh(patient) ? Q1 (Inh(report))
select Inh(report) as date, p.SSN,
p.name, p.policy
from DB1 patient p, DB1
visitInfo v
where p.SSN v.SSN and v.date
Inh(report)
Recall DB1 patient (SSN, name, policy),
visitInfo (SSN, trId, date)
Parameter in a query Inh(report) as a constant
Data driven the number of patients depends on Q1

report
Inh
patient
patient
patient
26
Multi-source query

patient ? SSN, name, treatments, bill
Inh(SSN) Inh(patient).SSN, . . . ,
Inh(treatments) Q2(v) -- v
Inh(patient)
select t.trId, t.tname
from DB1 visitInfo i, DB3 cover
c, DB4 treatment t
where i.SSN v.SSN and i.date
v.date and t.trId i.trId
and c.trId i.trId and c.policy
v.policy
a single query uses DB1, DB3 and DB4
tuple- and set-valued attributes (Inh(SSN),
Inh(treatments))

Recall DB1 patient(SSN, name, policy),
visitInfo(SSN, trId, date)
DB3 cover (policy, trId)
DB4 treatment (trId, name),
procedure (trId1, trId2)

27
Initial top-down pass context-dependent

patient ? SSN, name, treatments, bill
Inh(SSN) Inh(patient).SSN, Inh(name)
Inh(patient).name,
Inh(treatments) Q2(Inh(patient))
Inh(bill) Syn(treatments) lt--
halt
DTD-directed generate children following the
production
Inh(bill) defined with sibling --
Syn(treatments), dependency ordering evaluate
bill after treatments

lt- - halt
28
Initial top-down pass recursion

treatments ? treatment
Inh(treatment) ? Inh(treatments) -- set of
(trId, tname)
Data driven treatments expansion depends on
Inh(treatments)
empty expansion terminates, Syn(treatments) is
empty
nonempty expands.

29
Leaf step

treatment ? trId, tname, treatments
Inh(trId) Inh(treatment).trId, . . .,
trId ? PCDATA

Syn(trId) Inh(trId)
treatments
Inh (trId, tname)
treatment
treatment
treatments
tname
trId
Inh
30
Bottom-up step synthesize attributes

treatment ? trId, tname, treatments
treatments ? treatment
Syn(treatments) U Syn(treatment)
Processing of an element e Inh(e) ? subtree(e) ?
Syn(e)

Syn(treatment) Syn(trId) ? Syn(treatments)
31
Sideways step controlled derivation

patient ? SSN, name, treatments, bill
Inh(bill) Syn(treatments)
bill ? item
Inh(item) ? Q(Inh(bill) )
select trId, price
from DB2 billing
where trId in Inh(bill) -- set
membership test
DTD-directed each step of construction follows a
production

Recall DB2 billing (trId, price)
32
Constraint compilation

Captured with rules on synthesized attributes of
patient
trIdB bag-valued, collecting trIds under item
trIdS1 set-valued, collecting trIds under
treatment
trIdS2 set-valued, collecting trIds under item
key patient ( item.trId ? item )
unique (Syn(patient).trIdB) -- no
duplicates in the bag
foreign key patient ( treatment.trId ?
item.trId )
subset ( Syn(patient).trIdS1,
Syn(patient).trIdS2)
compilation semantic rules and attributes for
constraints are automatically generated and
evaluated

33
Advantages of AIG

DTD-directed view definition automatically
ensures conformance to DTD -- recursive,
nondeterministic
Constraint compilation automatically captures
integrity constraints in a uniform framework
performance avoid post-materialization checking
optimization jointly with query evaluation
exception handling actions when constraints are
violated
Controlled derivation supports context-dependent
generation.
Information passing top-down, bottom-up,
sideways
Multi-source queries optimizer-based
decomposition
One sweep each node is visited at most twice
evaluates its inherited attribute, subtree, then
its synthesized attribute

34
Middleware evaluation of AIGs
optimizer
AIG
XML
merging
query plan execution
pre-processing
tagging
scheduling
data
query
DB3
cost statistics
DB2

pre-processing
constraint compilation
multi-source query decomposition ? single-source
queries
optimizer query plan generation using cost
statistics
execution SQL queries ? data sources, results
? mediator
tagging relational tables (paths from root) ?
XML view via merge-sorting

35
Optimization

Goal reduce response time
costs query execution, data transfer, storage
(caching)
constraints query dependency graph (DAG)
nodes queries computing inherited/synthesized
attributes
edges dependency relation (producer-consumer
relation)
recursion iterative unfolding by a certain depth
Optimization techniques
query merging
query scheduling

36
Query scheduling

Goal reduce the total response time by
increasing parallelism
Ordering execution of queries on the same site
e.g., ltQ4, Q5, Q3gt vs. ltQ3, Q5, Q4gt on DB2
Constraints
costs execution, communication, caching
overheads
dependency relation
Finding an optimal schedule NP-hard

37
Query merging

Goal reduce DB visits, leverage DB optimizer
Composition of queries on the same site
outer-join/union
Tradeoffs
large result tables with null
communication/caching cost
impact on scheduling -- changing query dependency
graph

Finding an optimal strategy for merging and
scheduling NP-hard
38
Cost-based heuristic scheduling

cost estimate given a fixed schedule, estimate
the completion time of a query Q, comp_time(Q)
statistics eval_cost(Q), size(Q), trans_cost(S,
S, size)
dependency Q cant start before comp_time(Q) if
Q -gt Q
scheduling given a fixed query dependency graph,
a heuristic based on dynamic programming to
find the most costly trailing path of each
query
sort queries to favor critical paths

39
Cost-based heuristic merging

greedy algorithm repeat until no further
improvement
merge each pair of queries on the same source
modify the query dependency graph accordingly
invoke scheduling w.r.t. the modified dependency
estimate the cost
pick the pair with the biggest improvement to
merge

Interaction between query scheduling and merging
40
Preliminary experimental study
Card(3-way self join) 4055 Card(4-way self
join) 6837

Data set running example insurance/hospital
recursion unfolding (depth) 3, 4, 5 --
treatments
data size small, medium, large generated via
ToXgene
System
DB2 v8.1 for Linux
simulation of data transfer bandwidth1Mbps

41
Experimental results ratio of improvement
438/193
61/34
42
AIG Summary

AIG a novel specification language
ensures DTD-conformance (recursion/nondeterminism)
captures integrity constraints in a uniform
framework
supports complex transformations controlled
derivation (context-dependency), multi-source
queries, . . .
Optimization techniques nontrivial optimization
problems
constraint compilation, multi-source query
decomposition
query scheduling w.r.t. query dependency graph
query merging and its interaction with scheduling

A systematic framework for schema-directed XML
integration
43
XML Integration

Warehouse vs. mediator introduction
AIG (warehouse)
Attribute Integration Grammar schema-directed
integration
Optimization techniques
Enosys (mediator)
Active XML (hybrid)

44
Enosys a mediated system

XMLizer wrapper, converting source to virtual
XML view
Mediator export virtual integrated XML (VIX)
database
Translator rewrites XML queries to intermediate
algebraic exps
Rewriter decompose
multi-source queries
Optimizer query plan
generation
Execution sends
requests to wrappers,
compose results,
tagging,

query
Mediator
Translator
Rewriter/optimizer
Execution engine
XMLizer
XMLizer
XMLizer
source
source
source
45
Active XML a hybrid of warehouse and mediator

XML doc template a mix of
data nodes -- materialization
function nodes embedded calls to
Web services, queries -- virtual
Integration
data nodes concrete data (static)
function nodes extract up-to-date data (dynamic)
Data exchange materialization before/after
sending the doc
After smaller doc (less transmission cost) --
Web services can be accessed from the receiver
site
Before for security (access control) and
capability reasons
Optimization very hard
Target-schema conformance? Fixed doc template

46
Summary and review

What are main approaches to integrating XML data?
What are the major difficulties?
For what applications warehousing is preferable
to a mediator based approach?
Understand AIG, Enosys and Active XML
Given an AIG and relational sources, you should
be able to
understand how AIG works by providing the
integrated XML data
understand the optimization and evaluation
process
Can you combine Active XML and ATG to ensure
target schema-conformance?
Find and read papers about the GUI of Enosys
Find and read papers on data integration based on
Active XML