Title: Henning Christiansen
1Optimizing integrity checking indata integration
systems withsimplification techniques
2Simplification
- A technique for improving efficiency of integrity
checking in traditional databases - Specializing ICs for specific update, using
assumption that DB consistent before update - Suggested by J.-M. Nicolas, 1982
- Never become part of standard (R)DBMS
- Elaborated recently by present authors
- Uncovered theoretical limitations of simp.
- General and powerful methods developed
- Typically an order of magnitude gained (or more)
3This talk
- Considering how simplification can be used for
integrity checking maintenance in data
integration systems/integrated databases
4Picture of a DI system with ICs
- IC1 ICn ---
autonomous sources - . . .
cross-source constraints - ICglobal
S1
Sn
msg. about updates performed
Virtual global DB
Trusted IC1,...,ICn, cross-s. c. Desired
ICglobal
5General definitions
- Database D ltF,?,?gt
- F DB facts
- ? Trusted constraints F ? (IC1,...,ICn,
cross s.c.) - ? constraints to be checked (unfolded version of
global ICs) - Update A set of literals (add, delete)
- A ? U gt A not ? F A ? U gt A ? F
- Composition, negation and application of updates
- U o V, U, F o U, DU ltFoU,?,?gt
- Props F o (U o V) (F o U) o V, F o U o U F
6Simplification framework
Works for denial constraints, e.g. ? p(x) ?
??y q(x,y) Parameterized update patterns, i.e.
do simp. at design time Cases of
recursion Constraints over aggregate values SQL
type updates Produces convincing
results Implemented, available on the web
- AfterU(IC) constraints which in any DB D
evaluates to same truth value as IC in DU - Example Afterp(a)(?p(x)?q(x)) equiv
- ? (p(x) ? xa) ? q(x)
- Optimize?(?) a best ? so that
-
- For any D with D ?, D ? iff D ?
- See D.Martinenghi's PhD thesis for
- analysis of what "best" means
- a capable implementation of Optimize
7Exercise 1 Simplification for efficient IC
enforcement in mono DB
- Example
- ? p(x)?q(x)
- U p(a)
- Simplified check ? q(a)
- Given DB DltF,?gt and update U
- if D Optimize?(AfterU(?)) then
- perform U
- else
- reject U
I.e. the ICs are optimized by assumption that
current DB is consistent test before update, and
no "bad" update executed! Tested on wide range
of examples it works.
8Exercise 2 Optimal IC check of DI system at
"integration time"
- Given DI database D ltF,?,?gt
- F DB facts
- ? Trusted constraints F ? (IC1,...,ICn,
cross s.c.) - ? constraints to be checked (unfolded version of
global ICs) - D consistent iff F Optimize?(?)
Example ICi ?pi(x)?qi(x),
i1,2,global Global p,qunion of local
ones With ?IC1 U IC2, simp. check is
?p1(x)?q2(x), ?p2(x)?q1(x) With ?IC1 U IC2 U
"sources disjoint", simp. check is true, i.e.,
integration can't go wrong
9Maintain consistent view of DI system using
correction table - (preliminary work no
practical experience)
- IC1 ICn ---
autonomous sources - . . .
cross-source constraints - ICglobal
S1
Sn
Virtual corrections to sources
Task Maintain correction table so that ICglobal
holds gt provide consistent global view
Virtual global DB
NB Embury al, 2001, has made extensive study
of CTs but without simplification
10Correction table, CT
- ?p(x)?q(x), ?r(x)??q(x)
- F p(a),q(a),r(a)
- Repairing instance ?p(a)?q(a) by
- CT ... ?q(a) ...
- creates another failing instance ?r(a)??q(a)
Known result easy to find examples
- Def A CT for a database D ltF,?,?gt is an update
R such that DR ? U ?. - R is minimal if no subset of R is a CT.
- Informally A CT is a virtual update which, if
executed would restore consistency - Problem statement How to produce a CT and how to
maintain it incrementally when updates are
reported from the sources
Intuitively Simp. removes all traces of the
update, so we need as well consider CTs that
undoes part of update (no time for example more
later)
- Problems
- Exponentially many (minimal) CTs
- Correcting one problem may cause another
- Generating CTs from simplified checks only does
- not give us all relevant CTs
11Relating CTs to simplification
- Assume updated state DU with R' being a CT for D.
- Let ? be a set of constraints with DU ?.
- Then R is CT for DU iff
- DU Optimize?(AfterR(? U ?))
- where ? are all constraints we know holds in DU
- After?UoR'(?) ? U ? ? ? After?U (?)
- ? After?UoR' (?)
12Special case consistently signed ICs
- Def. consist. signed ?...p(...)... ,
?...?p(...)... - Let ? be as in previous slide
- ? Optimize?(?) .... (depends on old R' and
U) - S Collect one literal from each instance
? ? ? - with DU/ ?
-
- Expected property
- Any minimal CT is a subset of ?U ? ?S
-
Example --gt -
13Example
Notice ? evaluated, not ?
- ? ? p(x)?q(x)
- U p(a),p(b)
- ? q(a), ? q(b)
- S q(a),q(b)
- ?U ? ?S
- p(a),p(b),q(a),q(b)
- Practical version dialogue with data
verification agent, e.g., human expert, "voting",
rules-of-thump (e.g., AGM postulates)
14Maintenance of CTs, general case
- I.e., dropping consistently signed requirement
- We can suggest similar algorithm which requires
- repeated integrity check
- repeated runtime application of simp- procedure
- For practical purposes An engineering job ahead
- keep track of signs and trace changes
- partial evaluation, etc.
- to generate a sort of decision tree with
preproduced simp. checks
15Conclusion
- Simplification is a technique that cuts down
orders of magnitude for integrity checking - We have demonstrated
- effective and general simp. methods are possible
- simplification relevant for DI systems
- Future work
- practical, large scale implemenations, both mono
DI (??) - allow value modifications in CT (Ã la J.Wijsen)
- Further reading
- Simplification, theory and methods DM's PhD
thesis 2005 HCDM, Funda.Inf. 2006 - Simp DI HCDM, FoIKS'04, LAAIC'06