Title: Conditional Dependencies
1Conditional Dependencies
- Wenfei Fan
- University of Edinburgh
- and
- Bell Laboratories
2 Outline of Part III
- Conditional functional dependencies (CFDs)
- Motivation for extending FDs with conditions
data cleaning - Syntax and semantics
- Static analysis satisfiability, implication,
axiomatizability - Conditional inclusion dependencies (CINDs)
- Motivation data cleaning and schema matching
- Syntax and semantics
- Static analysis satisfiability, implication,
axiomatizability - Algorithms and open research issues
- SQL techniques for inconsistency detection
- Heuristic for satisfiability and implication
checking - Repair
3 Conditional functional dependencies (CFDs)
- Conditional functional dependencies (CFDs)
- Motivation for extending FDs with conditions
data cleaning - Syntax and semantics
- Static analysis satisfiability, implication,
axiomatizability - Conditional inclusion dependencies (CINDs)
- Motivation data cleaning and schema matching
- Syntax and semantics
- Static analysis consistency, implication,
axiomatizability - Algorithms and open research issues
- SQL techniques for inconsistency detection
- Heuristic for satisfiability and implication
checking - Repair
4Data in real-life is often dirty
- Errors, conflicts and inconsistencies
- Australia 500,000 dead people retain active
Medicare cards - US Pentagon asked 275 dead/wounded officers to
re-enlist. - UK there are 81 million National Insurance
numbers but only 60 million eligible citizens. - It is estimated that in a 500,000 customer
database, 120,000 customer records become invalid
within a year, due to deaths, divorces,
marriages, moves. - typical data error rate in industry 1 - 5, up
to 30 - . . .
5Dirty data is costly
- Poor data costs US companies 600 billion
annually - Wrong price data in retail databases costs US
customers 2.5 billion each year - AAA improves data quality by 20, and saves
150,000 in postage stamps alone - 30-80 of the development time for data cleaning
in a data integration project - and dont forget
- CIA intelligence on WMD in Iraq!
The need for (semi-)automated methods to clean
data!
6Characterizing the consistency of data
- One of the central technical problems is how to
tell whether the data is dirty or clean - Specify consistency using integrity constraints
- Inconsistencies emerge as violations of
constraints - Constraints considered so far traditional
- functional dependencies
- inclusion dependencies
- denial constraints (a special case of full
dependencies) - . . .
- Question are these traditional dependencies
sufficient?
7Example customer relation
- Schema Cust(country, area-code, phone, street,
city, zip) - Instance
- functional dependencies (FDs)
- custcountry, area-code, phone ? custstreet,
city, zip - custcountry, area-code ? custcity
- The database satisfies the FDs. Is the data
consistent?
8Capturing inconsistencies in the data
- cust (country 44, zip ? street)
- In the UK, zip code uniquely determines the
street - The constraint may not hold for other countries
- It expresses a fundamental part of the semantics
of the data - It can NOT be expressed as a traditional FD
- It does not hold on the entire relation instead,
it holds on tuples representing UK customers only
9Two more constraints
- cust(country 44, area-code 131, phone ?
street, zip, city EDI) - cust(country 01, area-code 908, phone ?
street, zip, city MH) - In the UK, if the area code is 131, then the city
has to be EDI - In the US, if the area code is 908, then the city
has to be MH - t1, t2 and t3 violate these constraints
- refining cust(country, area-code, phno ?
street, city, zip) - combining constants and variables
10The need for new constraints
- cust(country 44, zip ? street)
- cust(country 44, area-code 131, phone ?
street, zip, city EDI) - cust(country 01, area-code 908, phone ?
street, zip, city MH) - They capture inconsistencies that traditional FDs
cannot detect - Traditional constraints were developed for
schema design, not for data cleaning! - Data integration in real-life source constraints
- hold on a subset of sources
- hold conditionally on the integrated data
- They are NOT expressible as traditional FDs
- do not hold on the entire relation
- contain constant data values, besides logical
variables
11 Conditional Functional Dependencies (CFDs)
- An extension of traditional FDs (R X ? Y, Tp)
- X ? Y embedded traditional FD on R
- Tp a pattern tableau
- attributes X ? Y
- each tuple in Tp consists of constants and
unnamed variable _ - Example cust(country 44, zip ? street)
- (cust (country, zip ? street), Tp)
- pattern tableau Tp
12 Example CFDs
- Represent
- cust(country 44, area-code 131, phone ?
street, zip, city EDI) - cust(country 01, area-code 908, phone ?
street, zip, city MH) - cust(country, area-code, phone ? street,
city, zip) - as a SINGLE CFD
- (cust(country, area-code, phone ? street, city,
zip), Tp) - pattern tableau Tp one tuple for each constraint
13 Traditional FDs as a special case
- Express
- custcountry, area-code ? custcity
- as a CFD
- (cust(country, area-code, ? city), Tp)
- pattern tableau Tp a single tuple consisting of
_ only - CFDs subsume traditional FDs
14 Semantics of CFDs
- a ? b (a matches b) if
- either a or b is _
- both a and b are constants and a b
- tuple t1 matches t2 t1 ? t2
- (a, b) ? (a, _), but (a, b) does not match (a,
c) - DB satisfies (R X ? Y, Tp) iff for any tuple tp
in the pattern tableau Tp and for any tuples t1,
t2 in DB, if t1X t2X ? tpX, then t1Y
t2Y ? tpY - tpX identifying the set of tuples on which the
constraint tp applies, ie, t tX ? tpX - t1Y t2Y ? tpY enforcing the embedded FD,
and the pattern of tp
15 Example violation of CFDs
- cust(country 44, zip ? street)
- Tuples t1 and t2 violate the CFD
- t1country, zip t2country, zip ? tpcountry,
zip - t1street ? t2street
- The CFD applies to t1 and t2 since they match
tpcountry, zip
16 Violation of CFDs by a single tuple
- (cust(country, area-code ? city), Tp)
- Tuple t1 does not satisfy the CFD
- t1country, area-code t1country, area-code ?
tp1country, area-code - t1city t1city however, t1city does not
match tp1city - In contrast to traditional FDs, a single tuple
may violate a CFD
17 CFDs vs. conditional tables
- Conditional tables, Codd tables and variable
tables have been studied for incomplete
information - Conditional tables representing infinitely many
relation instances, one for each instantiation of
variables - Pattern tableau in a CFD each pattern tuple is a
constraint, and all constraints applying to the
same relation instance - Relational table, traditional dependencies and
CFDs - One end of the spectrum relations consisting of
data only - The other end of the spectrum traditional
dependencies defined in terms of logic variables - CFD in the between, both data values and logic
variables - CFDs enforcing binding of semantically related
data values
18Dirty constraints?
- A set of CFDs may be inconsistent!
- Inconsistent (R(A ? B), Tp)
- In any nonempty database DB and for any tuple t
in DB, - tp1 tB must be b
- tp2 tB must be c
- Inconsistent if b and c are different
- inconsistent ? ?1, ?2 , ?1 (R(A ? B),
Tp1), ?2 (R(B ? A), Tp2) - Why?
Tp
19The satisfiability problem
- The satisfiability problem for CFDs is to
determine, given a set ? of CFDs, whether or not
there exists a nonempty database DB that
satisfies ?, i.e., for any ? in ?, DB satisfies
?. - Whether or not ? makes sense
- For traditional FDs, it is not an issue one can
specify any FDs without worrying about their
consistency - In contrast, a set of CFDs may be inconsistent!
20The complexity of the satisfiability analysis
- Theorem. The satisfiability problem for CFDs is
NP-complete. - Nontrivial contrast this with the trivial
consistency analysis of FDs! - Proof idea
- Upper bound the small model property if ? is
satisfiable, then there is DB that satisfies ?
and consists of a single tuple! - Lower bound reduction from the non-tautology
problem - Good news PTIME special cases
- Theorem. Given a set ? of CFDs on a relation
schema R, the satisfiability of ? can be
determined in O( ?2) time if either - the schema R is predefined (fixed), or
- no attributes in ? have a finite domain
- Proof idea an extension of chase for CFDs
21The implication problem
- The implication problem for CFDs is to determine,
given a set ? of CFDs and a single CFD ?, whether
? implies ?, denoted by ? ?, i.e., for any
database DB, if DB satisfies ?, then DB satisfies
?. - Example
- ? ?1, ?2 , ?1 (R(A ? B), Tp1), ?2
(R(B ? C), Tp2) - ? (R(A ? C), Tp)
- ? ?. Why?
Tp1
Tp2
22The complexity of the implication problem
- For traditional FDs, the implication problem is
in linear time - In contrast, the implication problem for CFDs is
intractable - Theorem. The implication problem for CFDs is
coNP-complete. - Tractable special cases
- Theorem. Given a set ? of CFDs and a single CFD ?
on a relation schema R, whether ? ? can be
determined in O(( ? ? )2) time if either - the schema R is predefined, or
- no attributes in ? and ? have a finite domain
- Proof idea an extension of chase for CFDs
23Finite axiomatizability Flashback
- Armstrongs axioms can be found in every database
textbook - Reflexivity If Y ? X, then X ? Y
- Augmentation If X ? Y , then XZ ? YZ
- Transitivity If X ? Y and Y ? Z, then X ? Z
- Sound and complete for FD implication, i.e, ?
? iff ? can be inferred ? from using
reflexivity, augmentation, transitivity. - Question is there a sound and complete inference
system for the implication analysis of CFDs?
24Finite axiomatizability of CFDs
- Theorem. There is a sound and complete inference
system I for implication analysis of CFDs - Sound if ? - ?, i.e., ? can be proved from ?
using I, then ? ? - Complete if ? ?, then ? - ? using I
- The inference system is more involved than its
counterpart for traditional FDs, namely,
Armstrongs axioms. - There are 5 axioms.
- A normal form of CFDs (R X ? A, tp), tp is a
single pattern tuple.
25Axioms for CFDs extension of Armstrongs axioms
- Reflexivity If A ? X, then (R X ? A, tp), where
or
- Augmentation If (X ? A, tp) and B ? attr(R),
then (BX ? A, tp)
tp
tp
26Axioms for CFDs transitivity
- Transitivity if (A1,,Ak ? B1,,Bm, tp)
and (B1,,Bm ? C1,,Cn, tp)
match
(A1,,Ak ? C1,,Cn, tp)
27Axioms for CFDs reduction
- reduction if (B, X ? A, tp), tpB _,
and tpA a
then (X ? A, tp)
28Axioms for CFDs finite domain upgrade
- upgrade if only consistent values for B are b1,
b2, . . . , bn, dom(B) b1, , bn, , bm, and
(R A1, . . . ,Ak, B ? A, tp)
then (R A1, . . . ,Ak, B ? A, tp)
29Static analyses CFD vs. FD
- in the absence of finite-domain attributes
Theorem In the absence of finite-domain
attributes, Reflexivity, Augmentation,
Transitivity and Reduction are sound and
complete for CFD implication
- complications finite-domain attributes,
interaction between - satisfiability and implication analyses
30 Conditional Inclusion dependencies (CINDs)
- Conditional functional dependencies (CFDs)
- Motivation for extending FDs with conditions
data cleaning - Syntax and semantics
- Static analysis satisfiability, implication,
axiomatizability - Conditional inclusion dependencies (CINDs)
- Motivation data cleaning and schema matching
- Syntax and semantics
- Static analysis consistency, implication,
axiomatizability - Algorithms and open research issues
- SQL techniques for inconsistency detection
- Heuristic for satisfiability and implication
checking - Repair
31Example Amazon database
- Schema
- order(asin, title, type, price, country, county)
-- source - book(asin, isbn, title, price, format)
-- target - CD(asin, title, price, genre)
- asin Amazon standard identification number
- Instances
order
book
CD
32Schema matching
- Inclusion dependencies from source to target
(e.g., Clio)
- Traditional inclusion dependencies
- orderasin, title, price ? bookasin, title,
price - orderasin, title, price ? CDasin, title,
price - These inclusion dependencies do not make sense!
33Schema matching dependencies with conditions
- Conditional inclusion dependencies
- orderasin, title, price type book ?
bookasin, title, price - orderasin, title, price type CD ? CDasin,
title, price - orderasin, title, price ? bookasin, title,
price holds only if type book - orderasin, title, price ? CDasin, title,
price holds only if type CD - The constraints do not hold on the entire order
table
34Date cleaning with conditional dependencies
- CIND1 orderasin, title, price type book
? bookasin, title, price - CIND2 orderasin, title, price type CD ?
CDasin, title, price - Tuple t1 violates CIND1
- Tuple t2 violates CIND2
order
CD
book
35More on data cleaning
CD
book
- CDasin, title, price genre a-book ?
bookasin, title, price format audio - Inclusion relation CDasin, title, price ?
bookasin, title, price holds only if genre
a-book, i.e., when the CD is an audio book - In addition, the format of the corresponding book
must be audio a pattern for the referenced tuple
36 Conditional Inclusion Dependencies (CINDs)
- (R1X Xp ? R2Y Yp, Tp)
- R1X ? R2Y embedded traditional IND from R1
to R2 - Tp a pattern tableau
- attributes X ? Xp ? Y ? Yp
- tuples in Tp consist of constants and unnamed
variable _ - Example express
- CIND1 orderasin, title, price type book
? bookasin, title, price - (orderasin, title, price type ? bookasin,
title, price nil, Tp) - nil empty list
- pattern tableau Tp
37 Examples CINDs
- CIND2 orderasin, title, price type CD ?
CDasin, title, price - CIND3 CDasin, title, price genre a-book
? bookasin, title, price format audio
- (orderasin, title, price type ? CDasin,
title, price nil, Tp) - (CDasin, title, price genre ? bookasin,
title, price format, Tp)
38 Traditional CINDs as a special case
- R1X ? R2Y
- X A1, , An
- Y B1, , Bn
- As a CIND (R1X nil ? R2Y nil, Tp)
- pattern tableau Tp a single tuple consisting of
_ only - CINDs subsume traditional INDs
39 Semantics of CINDs
- DB (DB1, DB2), where DBj is an instance of Rj,
j 1, 2. - DB satisfies (R1X Xp ? R2Y Yp, Tp) iff for
any tuples t1 in DB1, and any tuple tp in the
pattern tableau Tp, if t1X, Xp ? tpX, Xp,
then there exists t2 in DB2 such that - t1Y t2Y (traditional IND semantics)
- t2Y, Yp ? tpY, Yp (matching the pattern
tuple on Y, Yp) - Patterns
- t1X, Xp ? tpX, Xp identifying the set of R1
tuples on which tp applies t1 t1X, Xp ?
tpX, Xp - t2Y, Yp ? tpY, Yp enforcing the embedded IND
and the constraint specified by patterns Y, Yp
40 Example
- (CDasin, title, price genre ? bookasin,
title, price format, Tp) - The following DB satisfies the CIND
book
CD
41More examples
- CIND1 (orderasin, title, price type ?
bookasin, title, price nil, Tp) - The following DB violates CIND1. Why?
order
book
CD
42The satisfiability problem for CINDs
- The satisfiability problem for CINDs is to
determine, given a set ? of CINDs, whether or not
there exists a nonempty database DB that
satisfies ?, i.e., for any ? in ?, DB satisfies
?. - Recall
- Any set of traditional INDs is always
satisfiable! - For CFDs, the satisfiability problem is
intractable. - In contrast.
- Theorem. Any set of CINDs is always satisfiable!
- Despite the increased expressive power, the
complexity of the satisfiability analysis does
not go up.
43The implication problem for CINDs
- The implication problem for CINDs is to decide,
given a set ? of CINDs and a single CIND ?,
whether ? implies ? (? ?). - For traditional INDs, the implication problem is
PS PACE-complete - For CINDs, the complexity does not hike up, to an
extent - Theorem. For CINDs containing no finite-domain
attributes, the implication problem is
PSPACE-complete - In the general setting, however, we have to pay a
price - Theorem. The implication problem for CINDs is
EXPTIME-complete - Proof idea
- Lower bound reduction from two-player tiling
game - Upper bound an extension of the chase for CINDs
44Finite axiomatizability of CINDs
- Rules for inferring IND implication
- Reflexivity If RX ? RX
- Projection and Permutation If R1A1, , Ak ?
R2B1, , Bk, then R1Ai1, , Aik ? R2Bi1, ,
Bik, - Transitivity If R1X ? R2Y and R2Y ? R3Z,
then R1X ? R3Z - Sound and complete for IND implication
- CINDs retain the finite axiomatizability
- Theorem. There is a sound and complete inference
system for implication analysis of CINDs - There are 8 axioms.
45Inference rules for CINDs
- Normal form of CINDs (R1X Xp ? R2Y Yp,
tp), - tp is a single pattern tuple
- tpA is a constant iff A is in Xp or Yp (tpB
_ if B is in X or Y) - Inference rules
- Reflexivity (RX nil ? RX nil, tp), where
- Projection and permutation If (R1X Xp ?
R2Y Yp, tp), then (R1X Xp ? R2Y Yp,
tp), for any permutation of X, Xp
tp
tp
46Axioms for CINDs transitivity
- Transitivity if (R1X Xp ? R2Y Yp, tp),
and (R2Y Yp ? R3Z Zp, tp),
equal
(R1X Xp ? R3Z Zp, tp)
47Axioms for CINDs downgrading
- downgrading if (R1X, A Xp ? R2Y, B Yp,
tp),
(R1X Xp, A ? R2Y Yp, B, tp)
48Axioms for CINDs augmentation
- augmentation if (R1X Xp ? R2Y Yp, tp), A
? attr(R1),
(R1X Xp, A ? R2Y Yp, tp)
49Axioms for CINDs reduction
- reduction if (R1X Xp ? R2Y Yp, B, tp),
then (R1X Xp ? R2Y Yp, tp),
50Axioms for CFDs finite domain reduction
- F-reduction if (R1X Xp, A ? R2Y Yp, tp),
- dom(A) a1,, an
then (R1X Xp ? R2Y Yp, tp),
51Axioms for CFDs finite domain upgrade
- upgrade if (R1X Xp, A ? R2Y, B Yp, tp),
- dom(A) a1,, an
then (R1X,A Xp ? R2Y, B Yp, tp),
52Static analyses CIND vs. IND
- in the absence of finite-domain attributes
Theorem In the absence of finite-domain
attributes, Reflexivity, Projection and
Permutation, Transitivity, Augmentation,
Downgrading and Reduction are sound and complete
for CIND implication
CINDs retain most complexity bounds of their
traditional counterpart
53CFDs and CINDs taken together
- We need both CFDs and CINDs for
- data cleaning
- schema matching
- Theorem. The implication problem for CFDs and
CINDs is undecidable - Not surprising The implication problem for
traditional FDs and INDs is already undecidable - Theorem. The consistency problem for CFDs and
CINDs is undecidable - In contrast, any set of traditional FDs and INDs
is consistent! - Proof idea induction from the implication
problem for FDs and INDs
54Static analyses CFD CIND vs. FD IND
- CINDs and CFDs properly subsume FDs and INDs
- Both the satisfiability analysis and
implication analysis are - beyond reach in practice
- This calls for effective heuristic methods
55 Algorithms and open research issues
- Conditional functional dependencies (CFDs)
- Motivation for extending FDs with conditions
data cleaning - Syntax and semantics
- Static analysis satisfiability, implication,
axiomatizability - Conditional inclusion dependencies (CINDs)
- Motivation data cleaning and schema matching
- Syntax and semantics
- Static analysis consistency, implication,
axiomatizability - Algorithms and open research issues
- SQL techniques for inconsistency detection
- Heuristic for satisfiability and implication
checking - Repair
56Detecting CFD Violations
CFD (cust(country, area-code, phone ? street,
city, zip), Tp)
detection
57Detecting CFD violations
- Input a set ? of CFDs and a database DB
- Output the set of tuples in DB that violate at
least one CFD in ? - Approach automatically generate SQL queries to
find violations - Complication 1 consider (R X ? Y, Tp), the
pattern tableau may be large (recall that each
tuple in Tp is in fact a constraint) - Goal the size of the SQL queries is independent
of Tp - Trick treat Tp as a data table
- CINDs can be checked along the same lines
58Single CFD step 1
- A pair of SQL queries, treating Tp as a data
table - Single-tuple violation (pattern matching)
- Multi-tuple violations (traditional FDs)
(cust(country, area-code, phone ? street, city,
zip), Tp)
- Single-tuple violation Qc
- select
- from R t, Tp tp
- where tcountry ? tpcountry AND
tarea-code ? tparea-code - AND tphone ? tpphone
- (tstreet tpstreet OR tcity
tpcity OR tzip tpzip)) - not matching
- tA1 ? tpA1 (tA1 tpA1 OR tpA1 _)
59Single CFD step 2
(cust(country, area-code, phone ? street, city,
zip), Tp)
- Multi-tuple violations (the semantics of
traditional FDs) Qv - select distinct t.country, t.area-code,
t.phone - from R t, Tp tp
- where tcountry ? tpcountry AND
tarea-code ? tparea-code - AND tphone ? tpphone
- group by t.country, t.area-code, t.phone
- having count(distinct street, city, zip)
1 - Tp is treated as a data table
60Multiple CFDs
- Complication 2 if the set ? has n CFDs, do we
use 2n SQL queries, and thus 2n passes of the
database DB? - Goal
- 2 SQL queries no matter how many CFDs are in ?
- the size of the SQL queries is independent of Tp
- Trick merge multiple CFDs into one
- Given (R X1 ? Y1, Tp1), (R X2 ? Y2, Tp2)
- Create a single pattern table Tm X1 ? X2 ? Y1
? Y2, - Introduce _at_, a dont-care variable, to populate
attributes of pattern tuples in X1 X2, etc
(tpA _at_) - Modify the pair of SQL queries by using Tm
61Handling multiple CFDs
CFD1 (area?state, T1)
CFDM(area,zip?state, TM)
CFD2 (zip?state, T2)
CFD3 (area,zip?state, T3)
Qc select from R t, TM tp where tarea
? tparea AND tzip ? tpzip AND tstate
tpstate Qv select distinct area, zip from
Macro group by area, zip having count(distinct
state) 1 Macro select (case tparea
when _at_ then _at_ else tarea end) as area . .
. from R t, TM tp where tarea
? tparea AND tzip ? tpzip AND tpstate _
62Keeping things tidy
Updating database
CFD (area?state, T)
- Tuple deletions
- A tuple deletion might remove violations.Tuples
that were dirty, before the deletion, might
become clean!
- Tuple insertions
- A tuple insertion might introduce
violations.Tuples that were clean, before the
insertion, might become dirty!
63Incremental inconsistency detection
- Input a set ? of CFDs, a database DB, the set V
of tuples in DB that violate ?, and changes ?DB
to DB - Output the set V of tuples in DB ?DB that
violate CFDs in ? - ?DB a set of tuples to be inserted into DB or
deleted from DB - Approaches
- Batch approach
- Compute DB DB ?DB
- Apply the SQL detection queries to DB
- Incremental approach compute change ? V such
that - the new violations V the old violations V
? V - Why?
64The need for incremental inconsistency detection
- Small ?DB to DB tends to incur only small changes
? V to V - more efficient to compute ? V then computing V
starting from scratch - Minimize unnecessary recomputation and traversal
of DB - Goal generate new SQL queries that perform
incremental detection - modify the SQL detection queries by leveraging
?DB and V - design and maintain auxiliary structures
(indexing, mark)
65Logging violations
CFD (area?state, T)
- Use one pair of columns, for each CFD
- Attribute BC records whether tuple t violates
query Qc of the CFD - Attribute BV records whether tuple t violates
query Qv of the CFD - Initialization
- update R t set tBC 1 where t in (QC)
- update R t set tBV 1 where t in (QV)
66Handling deletions
CFD (area?state, T)
How about batch deletions?
Let tdel be the tuple we want to delete from
R Step 1 delete from R t where t tdel Step
2 update R t set tBV 0 where tBV 1
AND tarea tdelarea AND 1
(select count(distinct state) from R t
where tarea tdelarea)
67Handling insertions
Step 1
Step 2
Step 3
CFD (area?state, T)
How about batch insertions?
Let tins be the tuple we want to insert into
R Step 1 insert into R values tins Step
2 update R t set tBC 1 where t tins
AND exists ( select from T tp where
tarea ? tparea AND tstate tpstate) Step
3 update R t set tBV 1, tinsBV 1
where tarea tinsarea AND tstate ?
tinsstate AND exists ( select from T
tp where tinsarea ? tparea AND tpstate _)
68Handling batch insertions
R
Let Rins be the tuples we want to insert into
R Step 1 Find tuples in Rins which violate
Qc Step 2 Find clean tuples in R that
become dirty, due to some tuple(s) in Rins
Step 3 Find tuples in Rins that become
dirty due to some dirty tuple(s) in R Step
4 Find clean tuples in Rins that violate the
CFD Step 5 Insert Rins in R
Step 2
Rins
Step 4
Step 3
Step 1
69The source code
- Let Rins be the tuples we want to insert into R
- Step 1 update Rins tins set tinsBC 1
where tins in (QC) - Step 2 update R t set tBV 1 where tBC
0 AND tBV 0 AND - exists (select from T where tarea ?
tparea AND tpstate _) AND - exists (select from Rins tins
where tinsarea tarea AND -
tinsstate ? tstate) - Step 3 update Rins tins set tinsBV 1 where
- exists (select from R t where tinsarea
tarea AND tBV 1) - Step 4 update Rins tins set tinsBV 1 where
tinsBC 0 AND tinsBV 0 AND - tinsarea IN (select area from Rins tins,
Tp tp where - tinsBC 0 AND
tinsBV 0 AND tinsarea ? tparea AND - tpstate _
- group by area
having count(distinct state) 1) - Step 5 insert into R values (select from
Rins)
70Scalability in Instance Size
71Scalability in NumConsts
72Scalability in Noise
73Merging CFDs
74Incremental Deletions
75Incremental Batch Deletions
76Incremental Insertions
77Incremental Batch Insertions
78Checking the satisfiability of CFDs
- Input a set ? of CFDs
- MAXSC find a maximum subset of ? that is
consistent - Complexity the MAXSC problem for CFDs is
NP-complete - Theorem there is an ?-approximation algorithm
for MAXSC - there exist constant ? such that for the subset
?m found by the algorithm has a bound card(?m)
? card(OPT(?)) - Proof idea approximation factor preserving
reduction to MAXGSAT - Open questions effective heuristic algorithms
- for checking the satisfiability of CFDs CINDs
(undecidable) - for determining implication of CFDs, CINDs, and
CFDs CINDs - for finding minimum cover of CFDs, CINDs, and
CFDs CINDs
79Automated methods for finding a repair
- Input a relational database DB, and a set ? of
CFDs - Output a repair DB of DB such that cost(DB,
DB) is minimal - repair DB satisfies ?
- good cost(DB, DB)
- DB is close to the original data in DB
- Minimizing changes to accurate attributes
- Complexity Finding an optimal repair is
- NP-complete (data complexity) for traditional
FDs, for a fixed set of FDs (or INDs) and fixed
schema - PSPACE-complete for CFDs CINDs (combined
complexity) - Open questions effective heuristic for repairing
databases based on CFDs, CINDs, and CFDs CINDs
80Incremental repair
- Input a clean database DB, changes ?DB to DB,
and a set ? of CFDs - Output a repair DB of DB ?DB
- Complexity. The local data cleaning problem is
NP-hard, even if ?DB consists of a single tuple. - Open questions find effective heuristic
algorithms for incrementally repairing databases
based on - CFDs
- CINDs
- CFDs CINDs
81Discovering CFDs and CINDs
- Input Sample databases of a schema R
- Output CFDs and CINDs that hold on all (or most)
database instances of R - Difficulty. A naïve approach may find
non-representative CFDs and CINDs as large as the
sample data - Open questions find effective method for
discovering - CFDs
- CINDs
- CFDs CINDs
82Summary
- Conditional functional dependencies
- for data cleaning rather than schema design
- complexity bounds of satisfiability and
implication analyses - a sound and complete inference system
- Conditional inclusion dependencies
- for data cleaning and schema matching in practice
- complexity bounds of satisfiability and
implication analyses - a sound and complete inference system
- Complexity bounds for CFDs and CINDs taken
together - SQL techniques for automatic detection of CFD
violations - a pair of SQL queries for validating multiple
CFDs - incremental techniques for validating CFDs
A practical method for data cleaning and schema
matching
83References
- Conditional Functional Dependencies for Data
Cleaning The 23rd International Conference on
Database Engineering (ICDE), 2007.Philip
Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia,
Anastasios Kementsietsidis - Extending Dependencies with Conditions Loreto
Bravo, Wenfei Fan, Shuai Ma - Improving Data Quality Consistency and Accuracy
Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia,
Shuai Ma - Conditional Functional Dependencies for Capturing
Data Inconsistencies Wenfei Fan, Floris Geerts,
Xibei Jia, Anastasios Kementsietsidis