TDD: Research Topics in Distributed Databases - PowerPoint PPT Presentation

About This Presentation
Title:

TDD: Research Topics in Distributed Databases

Description:

data warehouse: a repository of integrated information, ... multi-dimensional: data cubes or hypercubes. highly integrated and summarized: derived data ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 47
Provided by: infor178
Category:

less

Transcript and Presenter's Notes

Title: TDD: Research Topics in Distributed Databases


1
TDD Research Topics in Distributed Databases
  • XML Integration
  • Warehouse vs. mediator introduction
  • AIG (warehouse)
  • Attribute Integration Grammar schema-directed
    integration
  • Optimization techniques
  • Enosys (mediator)
  • Active XML (hybrid)

2
XML Integration
  • Warehouse vs. mediator introduction
  • distributed systems with middleware architecture
  • AIG (warehouse)
  • Attribute Integration Grammar schema-directed
    integration
  • Optimization techniques
  • Enosys (mediator)
  • Active XML (hybrid)

3
Middleware data warehouse
  • data warehouse a repository of integrated
    information, available for querying and analysis
  • data warehousing architectures, algorithms and
    tools for integrating data from multiple
    databases or other information sources into a
    single repository
  • heterogeneous sources
  • structured data
  • object-oriented databases, relational databases,
    ...
  • semistructured data
  • XML documents, Web data, ...
  • unstructured data
  • video, audio, ...

4
Warehouse architecture
client applications
data warehouse
integrator
monitor/wrapper
monitor/wrapper
monitor/wrapper
XML
RDB
OODB
5
Monitor/wrapper
  • A monitor/wrapper for each data source
    incrementally added
  • translation translate an information source into
    a common integrating model
  • change detection detect changes to the
    underlying data source and propagate the changes
    to the integrator
  • active databases (triggers condition, event,
    action)
  • logged sources inspecting logs
  • periodic polling, periodic dumps/snapshots
  • Data cleaning
  • detect erroneous/incomplete information to ensure
    validity
  • back flushing return cleaned data to the source

6
integrator
  • receive change notifications from the
    wrapper/monitors and reflect the changes in the
    data warehouse.
  • Typically a rule-based engine
  • merging information (e.g., skolemiation)
  • handling references
  • Data cleaning
  • removing redundancies and inconsistencies
  • inserting default values
  • blocking sources

7
warehouse
  • Data from data sources are imported into the
    warehouse
  • the underlying data sources are still operational
  • the data is replicated in the warehouse
  • The warehouse data is not typically in the same
    form and volume as in the underlying sources
  • metadata and subject-oriented for analytical
    purpose (e.g., sales, marketing, finance,
    distribution, )
  • historical data (timespan) cover a long time
    frame
  • multi-dimensional data cubes or hypercubes
  • highly integrated and summarized derived data
  • granularity roll up and drill down
  • large volume of data VLDB (very large database)

8
Applicability
  • Problem potential inconsistency with the
    sources.
  • Commonly used for relatively static data
  • when clients require specific, predicable portion
    of the available information
  • when clients require high query performance but
    not necessarily the most recent state of the
    information
  • when clients want summarized/aggregated
    information such as historical information
  • examples
  • scientific data
  • historical enterprise data
  • caching frequently requested information

9
Data warehouse vs. materialized views
  • materialized view is over an individual
    structured database, while a warehouse is over a
    collection of heterogeneous, distributed data
    sources
  • materialized view typically has the same form as
    in the underlying database, while a warehouse
    stores highly integrated and summarized data
  • materialized view modifications occur within the
    same transaction updating its underlying
    database, while a warehouse may have to deal with
    independent sources
  • sources simply report changes
  • sources may not have locking capability
  • integrator is loosely coupled with the sources

10
Mediated system architecture
  • Virtual approach data is not stored in the
    middle tier

client applications
Mediator
wrapper
wrapper
wrapper
XML
RDB
OODB
11
Lazy vs. eager approaches
  • Lazy approach (mediated systems)
  • accept a query, determine the appropriate set of
    data sources, generate sub-queries for each data
    source
  • obtain results from the data sources, perform
    translation, filtering and composing, and return
    the final answer
  • Eager approach (warehouses)
  • information from each source that may be of
    interest is extracted in advance, translated,
    filtered, merged with relevant sources, and
    stored in a repository
  • query is evaluated directly against the
    repository, without accessing the original
    information sources

12
Data warehouse vs. mediated systems
  • Efficiency
  • response time at the warehouse, queries can be
    answered efficiently without accessing original
    data sources. Advantageous when data sources are
    slow, expensive or periodically unavailable, or
    when translation, filtering and merging require
    significant processing
  • space warehousing consumes extra storage space
  • Extensibility warehouse
  • consistency with the sources warehouse data may
    become out of date
  • applicability
  • warehouses for high query performance and static
    data
  • mediated systems for information that changes
    rapidly

13
XML Integration
  • Warehouse vs. mediator introduction
  • AIG (warehouse)
  • Attribute Integration Grammar schema-directed
    integration
  • Optimization techniques
  • Enosys (mediator)
  • Active XML (hybrid)

14
Data integration in XML for data exchange
  • multiple, heterogeneous data sources
    multi-source queries, query decomposition, object
    fusion,
  • distributed sources scheduling of query
    execution
  • schema-conformance,

XML
15
Schema-directed integration

data exchange
  • Integration
  • extract relevant data from distributed, multiple
    databases
  • construct an XML view
  • Schema-directed conformance to a predefined
    schema (D, ?)
  • D a DTD, type constraints
  • ? a set of XML integrity constraints (keys,
    foreign keys)

16
Example hospital and insurance company
  • patient (SSN, name, policy)
  • visitInfo (SSN, trId, date)

billing (trId, price)
cover (policy, trId)
daily XML report
treatment (trId, name) procedure (trId1, trId2)
  • Given a date, for each patient of that day,
    report
  • SSN, name (DB1)
  • treatments (hierarchy) covered by insurance
    (DB1, DB3, DB4)
  • cost of all and only those treatments received
    (DB2)

17
Predefined schema (D, ?) DTD
  • report ? patient
  • patient ? SSN, name,
    treatments, bill
  • treatments ? treatment
  • treatment ? trId, tname,
    treatments
  • bill ? item
  • item ? trId, price

report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
. . .
price
trId
trId
tname
treatments
18
Predefined schema (D, ?) XML constraints ?
  • constraints relative to each patient,
  • key each treatment is charged only once
  • patient ( item.trId ? item )
  • foreign key every treatment has a billing record
  • patient ( treatment.trId ? item.trId )

report
date
patient
patient
patient
name
treatments
SSN
bill
treatment
item
item
treatment
. . .
price
trId
treatments
19
More on XML integrity constraints
  • absolute on the entire document
  • key country.name ? country
  • relative on a subdocument rooted at a country
  • key country (province.name ?
    province)
  • foreign key country (capital.inProvince ?
    province.name)
  • (value inclusion)

20
Challenge nondeterministic structure
  • certain structure cannot be decided at compile
    time
  • recursion expansion - the depth of XML tree
  • treatments ? treatment
  • treatment ? trId, tname, treatments
    --- recursive
  • choice of children in a disjunction, e.g. A ? B
    C

report
date
patient
patient
patient
treatments
bill
name
SSN
item
treatment
item
treatment
price
trId
trId
tname
treatments
. . .
unbounded --gt
21
Challenge context dependency
  • bill subtree all and only the trIds in the
    treatments subtree
  • controlled derivation the bill subtree cannot be
    started before the treatments subtree is
    completed.
  • information passing downward, upward, sideways

date
report
patient
date
SSN
treatments
name
SSN
bill
trIdS
trId
trId
unbounded
22
Challenges
  • DTD-conformance recursive, nondeterministic
  • integrity constraints validation during
    document generation
  • multi-source queries a single one involves
    several databases
  • context-dependency not strictly top-down or
    bottom-up
  • Previous work
  • XML publishing single data source, no
    constraints, top-down.
  • XML integration little support for XML
    schema-conformance.
  • XML query languages type checking provide no
    guidance for how to ensure schema-conformance
    optimization hard.

Schema-directed XML integration is nontrivial!
23
Middleware for schema-directed integration

view definition language
optimization techniques
  • A lightweight language Attribute Integration
    Grammar (AIG)
  • Cost-based optimization in light of
    context-dependency

DTD constraints
semantic attributes

semantic rules
24
Attribute Integration Grammar (AIG)
  • DTD element type definitions e ? ?
  • ? PCDATA ? e1, , en
    e1 en e
  • Attributes associated with each element type e
  • Inh(e) inherited from parent/siblings
    (top-down/sideways)
  • Syn(e) synthesized from children (bottom-up)
  • Syn(e), Inh(e) tuple or set/bag-valued
  • Rules associated with each production e ? ?, for
    e in ?
  • Inh(child) Inh(e) Q(Inh(parent),
    Syn(sibling) )
  • Syn(parent) Syn(e) U (Syn(children))
    -- union
  • Q multi-source SQL query with parameters
  • Dependency e2 must be evaluated before e1 if
  • Inh(e1) Q( Syn(e2) ) -- acyclic graph
    (DAG)

25
AIG semantics conceptual evaluation
  • following the dependency ordering starting from
    the root
  • report ? patient
  • Inh(patient) ? Q1 (Inh(report))
  • select Inh(report) as date, p.SSN,
    p.name, p.policy
  • from DB1 patient p, DB1
    visitInfo v
  • where p.SSN v.SSN and v.date
    Inh(report)
  • Recall DB1 patient (SSN, name, policy),
    visitInfo (SSN, trId, date)
  • Parameter in a query Inh(report) as a constant
  • Data driven the number of patients depends on Q1

report
Inh
patient
patient
patient
26
Multi-source query
  • patient ? SSN, name, treatments, bill
  • Inh(SSN) Inh(patient).SSN, . . . ,
  • Inh(treatments) Q2(v) -- v
    Inh(patient)
  • select t.trId, t.tname
  • from DB1 visitInfo i, DB3 cover
    c, DB4 treatment t
  • where i.SSN v.SSN and i.date
    v.date and t.trId i.trId
  • and c.trId i.trId and c.policy
    v.policy
  • a single query uses DB1, DB3 and DB4
  • tuple- and set-valued attributes (Inh(SSN),
    Inh(treatments))
  • Recall DB1 patient(SSN, name, policy),
    visitInfo(SSN, trId, date)
  • DB3 cover (policy, trId)
  • DB4 treatment (trId, name),
    procedure (trId1, trId2)

27
Initial top-down pass context-dependent
  • patient ? SSN, name, treatments, bill
  • Inh(SSN) Inh(patient).SSN, Inh(name)
    Inh(patient).name,
  • Inh(treatments) Q2(Inh(patient))
  • Inh(bill) Syn(treatments) lt--
    halt
  • DTD-directed generate children following the
    production
  • Inh(bill) defined with sibling --
    Syn(treatments), dependency ordering evaluate
    bill after treatments

lt- - halt
28
Initial top-down pass recursion
  • treatments ? treatment
  • Inh(treatment) ? Inh(treatments) -- set of
    (trId, tname)
  • Data driven treatments expansion depends on
    Inh(treatments)
  • empty expansion terminates, Syn(treatments) is
    empty
  • nonempty expands.

29
Leaf step
  • treatment ? trId, tname, treatments
  • Inh(trId) Inh(treatment).trId, . . .,
  • trId ? PCDATA

Syn(trId) Inh(trId)
treatments
Inh (trId, tname)
treatment
treatment
treatments
tname
trId
Inh
30
Bottom-up step synthesize attributes
  • treatment ? trId, tname, treatments
  • treatments ? treatment
  • Syn(treatments) U Syn(treatment)
  • Processing of an element e Inh(e) ? subtree(e) ?
    Syn(e)

Syn(treatment) Syn(trId) ? Syn(treatments)
31
Sideways step controlled derivation
  • patient ? SSN, name, treatments, bill
  • Inh(bill) Syn(treatments)
  • bill ? item
  • Inh(item) ? Q(Inh(bill) )
  • select trId, price
  • from DB2 billing
  • where trId in Inh(bill) -- set
    membership test
  • DTD-directed each step of construction follows a
    production

Recall DB2 billing (trId, price)
32
Constraint compilation
  • Captured with rules on synthesized attributes of
    patient
  • trIdB bag-valued, collecting trIds under item
  • trIdS1 set-valued, collecting trIds under
    treatment
  • trIdS2 set-valued, collecting trIds under item
  • key patient ( item.trId ? item )
  • unique (Syn(patient).trIdB) -- no
    duplicates in the bag
  • foreign key patient ( treatment.trId ?
    item.trId )
  • subset ( Syn(patient).trIdS1,
    Syn(patient).trIdS2)
  • compilation semantic rules and attributes for
    constraints are automatically generated and
    evaluated

33
Advantages of AIG
  • DTD-directed view definition automatically
    ensures conformance to DTD -- recursive,
    nondeterministic
  • Constraint compilation automatically captures
    integrity constraints in a uniform framework
  • performance avoid post-materialization checking
  • optimization jointly with query evaluation
  • exception handling actions when constraints are
    violated
  • Controlled derivation supports context-dependent
    generation.
  • Information passing top-down, bottom-up,
    sideways
  • Multi-source queries optimizer-based
    decomposition
  • One sweep each node is visited at most twice
    evaluates its inherited attribute, subtree, then
    its synthesized attribute

34
Middleware evaluation of AIGs
optimizer
AIG
XML
merging
query plan execution
pre-processing
tagging
scheduling
data
query
DB3
cost statistics
DB2
  • pre-processing
  • constraint compilation
  • multi-source query decomposition ? single-source
    queries
  • optimizer query plan generation using cost
    statistics
  • execution SQL queries ? data sources, results
    ? mediator
  • tagging relational tables (paths from root) ?
    XML view via merge-sorting
  • DB1

35
Optimization
  • Goal reduce response time
  • costs query execution, data transfer, storage
    (caching)
  • constraints query dependency graph (DAG)
  • nodes queries computing inherited/synthesized
    attributes
  • edges dependency relation (producer-consumer
    relation)
  • recursion iterative unfolding by a certain depth
  • Optimization techniques
  • query merging
  • query scheduling

36
Query scheduling
  • Goal reduce the total response time by
    increasing parallelism
  • Ordering execution of queries on the same site
  • e.g., ltQ4, Q5, Q3gt vs. ltQ3, Q5, Q4gt on DB2
  • Constraints
  • costs execution, communication, caching
    overheads
  • dependency relation
  • Finding an optimal schedule NP-hard

37
Query merging
  • Goal reduce DB visits, leverage DB optimizer
  • Composition of queries on the same site
    outer-join/union
  • Tradeoffs
  • large result tables with null
    communication/caching cost
  • impact on scheduling -- changing query dependency
    graph

Finding an optimal strategy for merging and
scheduling NP-hard
38
Cost-based heuristic scheduling
  • cost estimate given a fixed schedule, estimate
    the completion time of a query Q, comp_time(Q)
  • statistics eval_cost(Q), size(Q), trans_cost(S,
    S, size)
  • dependency Q cant start before comp_time(Q) if
    Q -gt Q
  • scheduling given a fixed query dependency graph,
    a heuristic based on dynamic programming to
  • find the most costly trailing path of each
    query
  • sort queries to favor critical paths

39
Cost-based heuristic merging
  • greedy algorithm repeat until no further
    improvement
  • merge each pair of queries on the same source
  • modify the query dependency graph accordingly
  • invoke scheduling w.r.t. the modified dependency
  • estimate the cost
  • pick the pair with the biggest improvement to
    merge

Interaction between query scheduling and merging
40
Preliminary experimental study
Card(3-way self join) 4055 Card(4-way self
join) 6837
  • Data set running example insurance/hospital
  • recursion unfolding (depth) 3, 4, 5 --
    treatments
  • data size small, medium, large generated via
    ToXgene
  • System
  • DB2 v8.1 for Linux
  • simulation of data transfer bandwidth1Mbps

41
Experimental results ratio of improvement
438/193
61/34
42
AIG Summary
  • AIG a novel specification language
  • ensures DTD-conformance (recursion/nondeterminism)
  • captures integrity constraints in a uniform
    framework
  • supports complex transformations controlled
    derivation (context-dependency), multi-source
    queries, . . .
  • Optimization techniques nontrivial optimization
    problems
  • constraint compilation, multi-source query
    decomposition
  • query scheduling w.r.t. query dependency graph
  • query merging and its interaction with scheduling

A systematic framework for schema-directed XML
integration
43
XML Integration
  • Warehouse vs. mediator introduction
  • AIG (warehouse)
  • Attribute Integration Grammar schema-directed
    integration
  • Optimization techniques
  • Enosys (mediator)
  • Active XML (hybrid)

44
Enosys a mediated system
  • XMLizer wrapper, converting source to virtual
    XML view
  • Mediator export virtual integrated XML (VIX)
    database
  • Translator rewrites XML queries to intermediate
    algebraic exps
  • Rewriter decompose
  • multi-source queries
  • Optimizer query plan
  • generation
  • Execution sends
  • requests to wrappers,
  • compose results,
  • tagging,

query
Mediator
Translator
Rewriter/optimizer
Execution engine
XMLizer
XMLizer
XMLizer
source
source
source
45
Active XML a hybrid of warehouse and mediator
  • XML doc template a mix of
  • data nodes -- materialization
  • function nodes embedded calls to
  • Web services, queries -- virtual
  • Integration
  • data nodes concrete data (static)
  • function nodes extract up-to-date data (dynamic)
  • Data exchange materialization before/after
    sending the doc
  • After smaller doc (less transmission cost) --
    Web services can be accessed from the receiver
    site
  • Before for security (access control) and
    capability reasons
  • Optimization very hard
  • Target-schema conformance? Fixed doc template

46
Summary and review
  • What are main approaches to integrating XML data?
    What are the major difficulties?
  • For what applications warehousing is preferable
    to a mediator based approach?
  • Understand AIG, Enosys and Active XML
  • Given an AIG and relational sources, you should
    be able to
  • understand how AIG works by providing the
    integrated XML data
  • understand the optimization and evaluation
    process
  • Can you combine Active XML and ATG to ensure
    target schema-conformance?
  • Find and read papers about the GUI of Enosys
  • Find and read papers on data integration based on
    Active XML
Write a Comment
User Comments (0)
About PowerShow.com