Conditional Dependencies - PowerPoint PPT Presentation

About This Presentation
Title:

Conditional Dependencies

Description:

Schema: Cust(country, area-code, phone, street, city, zip) Instance: 07974. NYC. Mountain Ave ... zip. city. street. phone. Area-code. country. 10. The need ... – PowerPoint PPT presentation

Number of Views:284
Avg rating:3.0/5.0
Slides: 84
Provided by: cseBu
Learn more at: https://cse.buffalo.edu
Category:

less

Transcript and Presenter's Notes

Title: Conditional Dependencies


1
Conditional Dependencies
  • Wenfei Fan
  • University of Edinburgh
  • and
  • Bell Laboratories

2

Outline of Part III
  • Conditional functional dependencies (CFDs)
  • Motivation for extending FDs with conditions
    data cleaning
  • Syntax and semantics
  • Static analysis satisfiability, implication,
    axiomatizability
  • Conditional inclusion dependencies (CINDs)
  • Motivation data cleaning and schema matching
  • Syntax and semantics
  • Static analysis satisfiability, implication,
    axiomatizability
  • Algorithms and open research issues
  • SQL techniques for inconsistency detection
  • Heuristic for satisfiability and implication
    checking
  • Repair

3

Conditional functional dependencies (CFDs)
  • Conditional functional dependencies (CFDs)
  • Motivation for extending FDs with conditions
    data cleaning
  • Syntax and semantics
  • Static analysis satisfiability, implication,
    axiomatizability
  • Conditional inclusion dependencies (CINDs)
  • Motivation data cleaning and schema matching
  • Syntax and semantics
  • Static analysis consistency, implication,
    axiomatizability
  • Algorithms and open research issues
  • SQL techniques for inconsistency detection
  • Heuristic for satisfiability and implication
    checking
  • Repair

4
Data in real-life is often dirty
  • Errors, conflicts and inconsistencies
  • Australia 500,000 dead people retain active
    Medicare cards
  • US Pentagon asked 275 dead/wounded officers to
    re-enlist.
  • UK there are 81 million National Insurance
    numbers but only 60 million eligible citizens.
  • It is estimated that in a 500,000 customer
    database, 120,000 customer records become invalid
    within a year, due to deaths, divorces,
    marriages, moves.
  • typical data error rate in industry 1 - 5, up
    to 30
  • . . .

5
Dirty data is costly
  • Poor data costs US companies 600 billion
    annually
  • Wrong price data in retail databases costs US
    customers 2.5 billion each year
  • AAA improves data quality by 20, and saves
    150,000 in postage stamps alone
  • 30-80 of the development time for data cleaning
    in a data integration project
  • and dont forget
  • CIA intelligence on WMD in Iraq!

The need for (semi-)automated methods to clean
data!
6
Characterizing the consistency of data
  • One of the central technical problems is how to
    tell whether the data is dirty or clean
  • Specify consistency using integrity constraints
  • Inconsistencies emerge as violations of
    constraints
  • Constraints considered so far traditional
  • functional dependencies
  • inclusion dependencies
  • denial constraints (a special case of full
    dependencies)
  • . . .
  • Question are these traditional dependencies
    sufficient?

7
Example customer relation
  • Schema Cust(country, area-code, phone, street,
    city, zip)
  • Instance
  • functional dependencies (FDs)
  • custcountry, area-code, phone ? custstreet,
    city, zip
  • custcountry, area-code ? custcity
  • The database satisfies the FDs. Is the data
    consistent?

8
Capturing inconsistencies in the data
  • cust (country 44, zip ? street)
  • In the UK, zip code uniquely determines the
    street
  • The constraint may not hold for other countries
  • It expresses a fundamental part of the semantics
    of the data
  • It can NOT be expressed as a traditional FD
  • It does not hold on the entire relation instead,
    it holds on tuples representing UK customers only

9
Two more constraints
  • cust(country 44, area-code 131, phone ?
    street, zip, city EDI)
  • cust(country 01, area-code 908, phone ?
    street, zip, city MH)
  • In the UK, if the area code is 131, then the city
    has to be EDI
  • In the US, if the area code is 908, then the city
    has to be MH
  • t1, t2 and t3 violate these constraints
  • refining cust(country, area-code, phno ?
    street, city, zip)
  • combining constants and variables

10
The need for new constraints
  • cust(country 44, zip ? street)
  • cust(country 44, area-code 131, phone ?
    street, zip, city EDI)
  • cust(country 01, area-code 908, phone ?
    street, zip, city MH)
  • They capture inconsistencies that traditional FDs
    cannot detect
  • Traditional constraints were developed for
    schema design, not for data cleaning!
  • Data integration in real-life source constraints
  • hold on a subset of sources
  • hold conditionally on the integrated data
  • They are NOT expressible as traditional FDs
  • do not hold on the entire relation
  • contain constant data values, besides logical
    variables

11

Conditional Functional Dependencies (CFDs)
  • An extension of traditional FDs (R X ? Y, Tp)
  • X ? Y embedded traditional FD on R
  • Tp a pattern tableau
  • attributes X ? Y
  • each tuple in Tp consists of constants and
    unnamed variable _
  • Example cust(country 44, zip ? street)
  • (cust (country, zip ? street), Tp)
  • pattern tableau Tp

12

Example CFDs
  • Represent
  • cust(country 44, area-code 131, phone ?
    street, zip, city EDI)
  • cust(country 01, area-code 908, phone ?
    street, zip, city MH)
  • cust(country, area-code, phone ? street,
    city, zip)
  • as a SINGLE CFD
  • (cust(country, area-code, phone ? street, city,
    zip), Tp)
  • pattern tableau Tp one tuple for each constraint

13

Traditional FDs as a special case
  • Express
  • custcountry, area-code ? custcity
  • as a CFD
  • (cust(country, area-code, ? city), Tp)
  • pattern tableau Tp a single tuple consisting of
    _ only
  • CFDs subsume traditional FDs

14

Semantics of CFDs
  • a ? b (a matches b) if
  • either a or b is _
  • both a and b are constants and a b
  • tuple t1 matches t2 t1 ? t2
  • (a, b) ? (a, _), but (a, b) does not match (a,
    c)
  • DB satisfies (R X ? Y, Tp) iff for any tuple tp
    in the pattern tableau Tp and for any tuples t1,
    t2 in DB, if t1X t2X ? tpX, then t1Y
    t2Y ? tpY
  • tpX identifying the set of tuples on which the
    constraint tp applies, ie, t tX ? tpX
  • t1Y t2Y ? tpY enforcing the embedded FD,
    and the pattern of tp

15

Example violation of CFDs
  • cust(country 44, zip ? street)
  • Tuples t1 and t2 violate the CFD
  • t1country, zip t2country, zip ? tpcountry,
    zip
  • t1street ? t2street
  • The CFD applies to t1 and t2 since they match
    tpcountry, zip

16

Violation of CFDs by a single tuple
  • (cust(country, area-code ? city), Tp)
  • Tuple t1 does not satisfy the CFD
  • t1country, area-code t1country, area-code ?
    tp1country, area-code
  • t1city t1city however, t1city does not
    match tp1city
  • In contrast to traditional FDs, a single tuple
    may violate a CFD

17

CFDs vs. conditional tables
  • Conditional tables, Codd tables and variable
    tables have been studied for incomplete
    information
  • Conditional tables representing infinitely many
    relation instances, one for each instantiation of
    variables
  • Pattern tableau in a CFD each pattern tuple is a
    constraint, and all constraints applying to the
    same relation instance
  • Relational table, traditional dependencies and
    CFDs
  • One end of the spectrum relations consisting of
    data only
  • The other end of the spectrum traditional
    dependencies defined in terms of logic variables
  • CFD in the between, both data values and logic
    variables
  • CFDs enforcing binding of semantically related
    data values

18
Dirty constraints?
  • A set of CFDs may be inconsistent!
  • Inconsistent (R(A ? B), Tp)
  • In any nonempty database DB and for any tuple t
    in DB,
  • tp1 tB must be b
  • tp2 tB must be c
  • Inconsistent if b and c are different
  • inconsistent ? ?1, ?2 , ?1 (R(A ? B),
    Tp1), ?2 (R(B ? A), Tp2)
  • Why?

Tp
19
The satisfiability problem
  • The satisfiability problem for CFDs is to
    determine, given a set ? of CFDs, whether or not
    there exists a nonempty database DB that
    satisfies ?, i.e., for any ? in ?, DB satisfies
    ?.
  • Whether or not ? makes sense
  • For traditional FDs, it is not an issue one can
    specify any FDs without worrying about their
    consistency
  • In contrast, a set of CFDs may be inconsistent!

20
The complexity of the satisfiability analysis
  • Theorem. The satisfiability problem for CFDs is
    NP-complete.
  • Nontrivial contrast this with the trivial
    consistency analysis of FDs!
  • Proof idea
  • Upper bound the small model property if ? is
    satisfiable, then there is DB that satisfies ?
    and consists of a single tuple!
  • Lower bound reduction from the non-tautology
    problem
  • Good news PTIME special cases
  • Theorem. Given a set ? of CFDs on a relation
    schema R, the satisfiability of ? can be
    determined in O( ?2) time if either
  • the schema R is predefined (fixed), or
  • no attributes in ? have a finite domain
  • Proof idea an extension of chase for CFDs

21
The implication problem
  • The implication problem for CFDs is to determine,
    given a set ? of CFDs and a single CFD ?, whether
    ? implies ?, denoted by ? ?, i.e., for any
    database DB, if DB satisfies ?, then DB satisfies
    ?.
  • Example
  • ? ?1, ?2 , ?1 (R(A ? B), Tp1), ?2
    (R(B ? C), Tp2)
  • ? (R(A ? C), Tp)
  • ? ?. Why?

Tp1
Tp2
22
The complexity of the implication problem
  • For traditional FDs, the implication problem is
    in linear time
  • In contrast, the implication problem for CFDs is
    intractable
  • Theorem. The implication problem for CFDs is
    coNP-complete.
  • Tractable special cases
  • Theorem. Given a set ? of CFDs and a single CFD ?
    on a relation schema R, whether ? ? can be
    determined in O(( ? ? )2) time if either
  • the schema R is predefined, or
  • no attributes in ? and ? have a finite domain
  • Proof idea an extension of chase for CFDs

23
Finite axiomatizability Flashback
  • Armstrongs axioms can be found in every database
    textbook
  • Reflexivity If Y ? X, then X ? Y
  • Augmentation If X ? Y , then XZ ? YZ
  • Transitivity If X ? Y and Y ? Z, then X ? Z
  • Sound and complete for FD implication, i.e, ?
    ? iff ? can be inferred ? from using
    reflexivity, augmentation, transitivity.
  • Question is there a sound and complete inference
    system for the implication analysis of CFDs?

24
Finite axiomatizability of CFDs
  • Theorem. There is a sound and complete inference
    system I for implication analysis of CFDs
  • Sound if ? - ?, i.e., ? can be proved from ?
    using I, then ? ?
  • Complete if ? ?, then ? - ? using I
  • The inference system is more involved than its
    counterpart for traditional FDs, namely,
    Armstrongs axioms.
  • There are 5 axioms.
  • A normal form of CFDs (R X ? A, tp), tp is a
    single pattern tuple.

25
Axioms for CFDs extension of Armstrongs axioms
  • Reflexivity If A ? X, then (R X ? A, tp), where

or
  • Augmentation If (X ? A, tp) and B ? attr(R),
    then (BX ? A, tp)

tp
tp
26
Axioms for CFDs transitivity
  • Transitivity if (A1,,Ak ? B1,,Bm, tp)

and (B1,,Bm ? C1,,Cn, tp)
match
(A1,,Ak ? C1,,Cn, tp)
27
Axioms for CFDs reduction
  • reduction if (B, X ? A, tp), tpB _,
    and tpA a

then (X ? A, tp)
28
Axioms for CFDs finite domain upgrade
  • upgrade if only consistent values for B are b1,
    b2, . . . , bn, dom(B) b1, , bn, , bm, and
    (R A1, . . . ,Ak, B ? A, tp)

then (R A1, . . . ,Ak, B ? A, tp)
29
Static analyses CFD vs. FD
  • General setting
  • in the absence of finite-domain attributes

Theorem In the absence of finite-domain
attributes, Reflexivity, Augmentation,
Transitivity and Reduction are sound and
complete for CFD implication
  • complications finite-domain attributes,
    interaction between
  • satisfiability and implication analyses

30

Conditional Inclusion dependencies (CINDs)
  • Conditional functional dependencies (CFDs)
  • Motivation for extending FDs with conditions
    data cleaning
  • Syntax and semantics
  • Static analysis satisfiability, implication,
    axiomatizability
  • Conditional inclusion dependencies (CINDs)
  • Motivation data cleaning and schema matching
  • Syntax and semantics
  • Static analysis consistency, implication,
    axiomatizability
  • Algorithms and open research issues
  • SQL techniques for inconsistency detection
  • Heuristic for satisfiability and implication
    checking
  • Repair

31
Example Amazon database
  • Schema
  • order(asin, title, type, price, country, county)
    -- source
  • book(asin, isbn, title, price, format)
    -- target
  • CD(asin, title, price, genre)
  • asin Amazon standard identification number
  • Instances

order
book
CD
32
Schema matching
  • Inclusion dependencies from source to target
    (e.g., Clio)
  • Traditional inclusion dependencies
  • orderasin, title, price ? bookasin, title,
    price
  • orderasin, title, price ? CDasin, title,
    price
  • These inclusion dependencies do not make sense!

33
Schema matching dependencies with conditions
  • Conditional inclusion dependencies
  • orderasin, title, price type book ?
    bookasin, title, price
  • orderasin, title, price type CD ? CDasin,
    title, price
  • orderasin, title, price ? bookasin, title,
    price holds only if type book
  • orderasin, title, price ? CDasin, title,
    price holds only if type CD
  • The constraints do not hold on the entire order
    table

34
Date cleaning with conditional dependencies
  • CIND1 orderasin, title, price type book
    ? bookasin, title, price
  • CIND2 orderasin, title, price type CD ?
    CDasin, title, price
  • Tuple t1 violates CIND1
  • Tuple t2 violates CIND2

order
CD
book
35
More on data cleaning
CD
book
  • CDasin, title, price genre a-book ?
    bookasin, title, price format audio
  • Inclusion relation CDasin, title, price ?
    bookasin, title, price holds only if genre
    a-book, i.e., when the CD is an audio book
  • In addition, the format of the corresponding book
    must be audio a pattern for the referenced tuple

36

Conditional Inclusion Dependencies (CINDs)
  • (R1X Xp ? R2Y Yp, Tp)
  • R1X ? R2Y embedded traditional IND from R1
    to R2
  • Tp a pattern tableau
  • attributes X ? Xp ? Y ? Yp
  • tuples in Tp consist of constants and unnamed
    variable _
  • Example express
  • CIND1 orderasin, title, price type book
    ? bookasin, title, price
  • (orderasin, title, price type ? bookasin,
    title, price nil, Tp)
  • nil empty list
  • pattern tableau Tp

37

Examples CINDs
  • CIND2 orderasin, title, price type CD ?
    CDasin, title, price
  • CIND3 CDasin, title, price genre a-book
    ? bookasin, title, price format audio
  • (orderasin, title, price type ? CDasin,
    title, price nil, Tp)
  • (CDasin, title, price genre ? bookasin,
    title, price format, Tp)

38

Traditional CINDs as a special case
  • R1X ? R2Y
  • X A1, , An
  • Y B1, , Bn
  • As a CIND (R1X nil ? R2Y nil, Tp)
  • pattern tableau Tp a single tuple consisting of
    _ only
  • CINDs subsume traditional INDs

39

Semantics of CINDs
  • DB (DB1, DB2), where DBj is an instance of Rj,
    j 1, 2.
  • DB satisfies (R1X Xp ? R2Y Yp, Tp) iff for
    any tuples t1 in DB1, and any tuple tp in the
    pattern tableau Tp, if t1X, Xp ? tpX, Xp,
    then there exists t2 in DB2 such that
  • t1Y t2Y (traditional IND semantics)
  • t2Y, Yp ? tpY, Yp (matching the pattern
    tuple on Y, Yp)
  • Patterns
  • t1X, Xp ? tpX, Xp identifying the set of R1
    tuples on which tp applies t1 t1X, Xp ?
    tpX, Xp
  • t2Y, Yp ? tpY, Yp enforcing the embedded IND
    and the constraint specified by patterns Y, Yp

40

Example
  • (CDasin, title, price genre ? bookasin,
    title, price format, Tp)
  • The following DB satisfies the CIND

book
CD
41
More examples
  • CIND1 (orderasin, title, price type ?
    bookasin, title, price nil, Tp)
  • The following DB violates CIND1. Why?

order
book
CD
42
The satisfiability problem for CINDs
  • The satisfiability problem for CINDs is to
    determine, given a set ? of CINDs, whether or not
    there exists a nonempty database DB that
    satisfies ?, i.e., for any ? in ?, DB satisfies
    ?.
  • Recall
  • Any set of traditional INDs is always
    satisfiable!
  • For CFDs, the satisfiability problem is
    intractable.
  • In contrast.
  • Theorem. Any set of CINDs is always satisfiable!
  • Despite the increased expressive power, the
    complexity of the satisfiability analysis does
    not go up.

43
The implication problem for CINDs
  • The implication problem for CINDs is to decide,
    given a set ? of CINDs and a single CIND ?,
    whether ? implies ? (? ?).
  • For traditional INDs, the implication problem is
    PS PACE-complete
  • For CINDs, the complexity does not hike up, to an
    extent
  • Theorem. For CINDs containing no finite-domain
    attributes, the implication problem is
    PSPACE-complete
  • In the general setting, however, we have to pay a
    price
  • Theorem. The implication problem for CINDs is
    EXPTIME-complete
  • Proof idea
  • Lower bound reduction from two-player tiling
    game
  • Upper bound an extension of the chase for CINDs

44
Finite axiomatizability of CINDs
  • Rules for inferring IND implication
  • Reflexivity If RX ? RX
  • Projection and Permutation If R1A1, , Ak ?
    R2B1, , Bk, then R1Ai1, , Aik ? R2Bi1, ,
    Bik,
  • Transitivity If R1X ? R2Y and R2Y ? R3Z,
    then R1X ? R3Z
  • Sound and complete for IND implication
  • CINDs retain the finite axiomatizability
  • Theorem. There is a sound and complete inference
    system for implication analysis of CINDs
  • There are 8 axioms.

45
Inference rules for CINDs
  • Normal form of CINDs (R1X Xp ? R2Y Yp,
    tp),
  • tp is a single pattern tuple
  • tpA is a constant iff A is in Xp or Yp (tpB
    _ if B is in X or Y)
  • Inference rules
  • Reflexivity (RX nil ? RX nil, tp), where
  • Projection and permutation If (R1X Xp ?
    R2Y Yp, tp), then (R1X Xp ? R2Y Yp,
    tp), for any permutation of X, Xp

tp
tp
46
Axioms for CINDs transitivity
  • Transitivity if (R1X Xp ? R2Y Yp, tp),

and (R2Y Yp ? R3Z Zp, tp),
equal
(R1X Xp ? R3Z Zp, tp)
47
Axioms for CINDs downgrading
  • downgrading if (R1X, A Xp ? R2Y, B Yp,
    tp),

(R1X Xp, A ? R2Y Yp, B, tp)
48
Axioms for CINDs augmentation
  • augmentation if (R1X Xp ? R2Y Yp, tp), A
    ? attr(R1),

(R1X Xp, A ? R2Y Yp, tp)
49
Axioms for CINDs reduction
  • reduction if (R1X Xp ? R2Y Yp, B, tp),

then (R1X Xp ? R2Y Yp, tp),
50
Axioms for CFDs finite domain reduction
  • F-reduction if (R1X Xp, A ? R2Y Yp, tp),
  • dom(A) a1,, an

then (R1X Xp ? R2Y Yp, tp),
51
Axioms for CFDs finite domain upgrade
  • upgrade if (R1X Xp, A ? R2Y, B Yp, tp),
  • dom(A) a1,, an

then (R1X,A Xp ? R2Y, B Yp, tp),
52
Static analyses CIND vs. IND
  • General setting
  • in the absence of finite-domain attributes

Theorem In the absence of finite-domain
attributes, Reflexivity, Projection and
Permutation, Transitivity, Augmentation,
Downgrading and Reduction are sound and complete
for CIND implication
CINDs retain most complexity bounds of their
traditional counterpart
53
CFDs and CINDs taken together
  • We need both CFDs and CINDs for
  • data cleaning
  • schema matching
  • Theorem. The implication problem for CFDs and
    CINDs is undecidable
  • Not surprising The implication problem for
    traditional FDs and INDs is already undecidable
  • Theorem. The consistency problem for CFDs and
    CINDs is undecidable
  • In contrast, any set of traditional FDs and INDs
    is consistent!
  • Proof idea induction from the implication
    problem for FDs and INDs

54
Static analyses CFD CIND vs. FD IND
  • CINDs and CFDs properly subsume FDs and INDs
  • Both the satisfiability analysis and
    implication analysis are
  • beyond reach in practice
  • This calls for effective heuristic methods

55

Algorithms and open research issues
  • Conditional functional dependencies (CFDs)
  • Motivation for extending FDs with conditions
    data cleaning
  • Syntax and semantics
  • Static analysis satisfiability, implication,
    axiomatizability
  • Conditional inclusion dependencies (CINDs)
  • Motivation data cleaning and schema matching
  • Syntax and semantics
  • Static analysis consistency, implication,
    axiomatizability
  • Algorithms and open research issues
  • SQL techniques for inconsistency detection
  • Heuristic for satisfiability and implication
    checking
  • Repair

56
Detecting CFD Violations
CFD (cust(country, area-code, phone ? street,
city, zip), Tp)
detection
57
Detecting CFD violations
  • Input a set ? of CFDs and a database DB
  • Output the set of tuples in DB that violate at
    least one CFD in ?
  • Approach automatically generate SQL queries to
    find violations
  • Complication 1 consider (R X ? Y, Tp), the
    pattern tableau may be large (recall that each
    tuple in Tp is in fact a constraint)
  • Goal the size of the SQL queries is independent
    of Tp
  • Trick treat Tp as a data table
  • CINDs can be checked along the same lines

58
Single CFD step 1
  • A pair of SQL queries, treating Tp as a data
    table
  • Single-tuple violation (pattern matching)
  • Multi-tuple violations (traditional FDs)

(cust(country, area-code, phone ? street, city,
zip), Tp)
  • Single-tuple violation Qc
  • select
  • from R t, Tp tp
  • where tcountry ? tpcountry AND
    tarea-code ? tparea-code
  • AND tphone ? tpphone
  • (tstreet tpstreet OR tcity
    tpcity OR tzip tpzip))
  • not matching
  • tA1 ? tpA1 (tA1 tpA1 OR tpA1 _)

59
Single CFD step 2
(cust(country, area-code, phone ? street, city,
zip), Tp)
  • Multi-tuple violations (the semantics of
    traditional FDs) Qv
  • select distinct t.country, t.area-code,
    t.phone
  • from R t, Tp tp
  • where tcountry ? tpcountry AND
    tarea-code ? tparea-code
  • AND tphone ? tpphone
  • group by t.country, t.area-code, t.phone
  • having count(distinct street, city, zip)
    1
  • Tp is treated as a data table

60
Multiple CFDs
  • Complication 2 if the set ? has n CFDs, do we
    use 2n SQL queries, and thus 2n passes of the
    database DB?
  • Goal
  • 2 SQL queries no matter how many CFDs are in ?
  • the size of the SQL queries is independent of Tp
  • Trick merge multiple CFDs into one
  • Given (R X1 ? Y1, Tp1), (R X2 ? Y2, Tp2)
  • Create a single pattern table Tm X1 ? X2 ? Y1
    ? Y2,
  • Introduce _at_, a dont-care variable, to populate
    attributes of pattern tuples in X1 X2, etc
    (tpA _at_)
  • Modify the pair of SQL queries by using Tm

61
Handling multiple CFDs
CFD1 (area?state, T1)
CFDM(area,zip?state, TM)
CFD2 (zip?state, T2)
CFD3 (area,zip?state, T3)
Qc select from R t, TM tp where tarea
? tparea AND tzip ? tpzip AND tstate
tpstate Qv select distinct area, zip from
Macro group by area, zip having count(distinct
state) 1 Macro select (case tparea
when _at_ then _at_ else tarea end) as area . .
. from R t, TM tp where tarea
? tparea AND tzip ? tpzip AND tpstate _
62
Keeping things tidy
Updating database
CFD (area?state, T)
  • Tuple deletions
  • A tuple deletion might remove violations.Tuples
    that were dirty, before the deletion, might
    become clean!
  • Tuple insertions
  • A tuple insertion might introduce
    violations.Tuples that were clean, before the
    insertion, might become dirty!

63
Incremental inconsistency detection
  • Input a set ? of CFDs, a database DB, the set V
    of tuples in DB that violate ?, and changes ?DB
    to DB
  • Output the set V of tuples in DB ?DB that
    violate CFDs in ?
  • ?DB a set of tuples to be inserted into DB or
    deleted from DB
  • Approaches
  • Batch approach
  • Compute DB DB ?DB
  • Apply the SQL detection queries to DB
  • Incremental approach compute change ? V such
    that
  • the new violations V the old violations V
    ? V
  • Why?

64
The need for incremental inconsistency detection
  • Small ?DB to DB tends to incur only small changes
    ? V to V
  • more efficient to compute ? V then computing V
    starting from scratch
  • Minimize unnecessary recomputation and traversal
    of DB
  • Goal generate new SQL queries that perform
    incremental detection
  • modify the SQL detection queries by leveraging
    ?DB and V
  • design and maintain auxiliary structures
    (indexing, mark)

65
Logging violations
CFD (area?state, T)
  • Use one pair of columns, for each CFD
  • Attribute BC records whether tuple t violates
    query Qc of the CFD
  • Attribute BV records whether tuple t violates
    query Qv of the CFD
  • Initialization
  • update R t set tBC 1 where t in (QC)
  • update R t set tBV 1 where t in (QV)

66
Handling deletions
CFD (area?state, T)
How about batch deletions?
Let tdel be the tuple we want to delete from
R Step 1 delete from R t where t tdel Step
2 update R t set tBV 0 where tBV 1
AND tarea tdelarea AND 1
(select count(distinct state) from R t
where tarea tdelarea)
67
Handling insertions
Step 1
Step 2
Step 3
CFD (area?state, T)
How about batch insertions?
Let tins be the tuple we want to insert into
R Step 1 insert into R values tins Step
2 update R t set tBC 1 where t tins
AND exists ( select from T tp where
tarea ? tparea AND tstate tpstate) Step
3 update R t set tBV 1, tinsBV 1
where tarea tinsarea AND tstate ?
tinsstate AND exists ( select from T
tp where tinsarea ? tparea AND tpstate _)
68
Handling batch insertions
R
Let Rins be the tuples we want to insert into
R Step 1 Find tuples in Rins which violate
Qc Step 2 Find clean tuples in R that
become dirty, due to some tuple(s) in Rins
Step 3 Find tuples in Rins that become
dirty due to some dirty tuple(s) in R Step
4 Find clean tuples in Rins that violate the
CFD Step 5 Insert Rins in R
Step 2
Rins
Step 4
Step 3
Step 1
69
The source code
  • Let Rins be the tuples we want to insert into R
  • Step 1 update Rins tins set tinsBC 1
    where tins in (QC)
  • Step 2 update R t set tBV 1 where tBC
    0 AND tBV 0 AND
  • exists (select from T where tarea ?
    tparea AND tpstate _) AND
  • exists (select from Rins tins
    where tinsarea tarea AND

  • tinsstate ? tstate)
  • Step 3 update Rins tins set tinsBV 1 where
  • exists (select from R t where tinsarea
    tarea AND tBV 1)
  • Step 4 update Rins tins set tinsBV 1 where
    tinsBC 0 AND tinsBV 0 AND
  • tinsarea IN (select area from Rins tins,
    Tp tp where
  • tinsBC 0 AND
    tinsBV 0 AND tinsarea ? tparea AND
  • tpstate _
  • group by area
    having count(distinct state) 1)
  • Step 5 insert into R values (select from
    Rins)

70
Scalability in Instance Size
71
Scalability in NumConsts
72
Scalability in Noise
73
Merging CFDs
74
Incremental Deletions
75
Incremental Batch Deletions
76
Incremental Insertions
77
Incremental Batch Insertions
78
Checking the satisfiability of CFDs
  • Input a set ? of CFDs
  • MAXSC find a maximum subset of ? that is
    consistent
  • Complexity the MAXSC problem for CFDs is
    NP-complete
  • Theorem there is an ?-approximation algorithm
    for MAXSC
  • there exist constant ? such that for the subset
    ?m found by the algorithm has a bound card(?m)
    ? card(OPT(?))
  • Proof idea approximation factor preserving
    reduction to MAXGSAT
  • Open questions effective heuristic algorithms
  • for checking the satisfiability of CFDs CINDs
    (undecidable)
  • for determining implication of CFDs, CINDs, and
    CFDs CINDs
  • for finding minimum cover of CFDs, CINDs, and
    CFDs CINDs

79
Automated methods for finding a repair
  • Input a relational database DB, and a set ? of
    CFDs
  • Output a repair DB of DB such that cost(DB,
    DB) is minimal
  • repair DB satisfies ?
  • good cost(DB, DB)
  • DB is close to the original data in DB
  • Minimizing changes to accurate attributes
  • Complexity Finding an optimal repair is
  • NP-complete (data complexity) for traditional
    FDs, for a fixed set of FDs (or INDs) and fixed
    schema
  • PSPACE-complete for CFDs CINDs (combined
    complexity)
  • Open questions effective heuristic for repairing
    databases based on CFDs, CINDs, and CFDs CINDs

80
Incremental repair
  • Input a clean database DB, changes ?DB to DB,
    and a set ? of CFDs
  • Output a repair DB of DB ?DB
  • Complexity. The local data cleaning problem is
    NP-hard, even if ?DB consists of a single tuple.
  • Open questions find effective heuristic
    algorithms for incrementally repairing databases
    based on
  • CFDs
  • CINDs
  • CFDs CINDs

81
Discovering CFDs and CINDs
  • Input Sample databases of a schema R
  • Output CFDs and CINDs that hold on all (or most)
    database instances of R
  • Difficulty. A naïve approach may find
    non-representative CFDs and CINDs as large as the
    sample data
  • Open questions find effective method for
    discovering
  • CFDs
  • CINDs
  • CFDs CINDs

82
Summary
  • Conditional functional dependencies
  • for data cleaning rather than schema design
  • complexity bounds of satisfiability and
    implication analyses
  • a sound and complete inference system
  • Conditional inclusion dependencies
  • for data cleaning and schema matching in practice
  • complexity bounds of satisfiability and
    implication analyses
  • a sound and complete inference system
  • Complexity bounds for CFDs and CINDs taken
    together
  • SQL techniques for automatic detection of CFD
    violations
  • a pair of SQL queries for validating multiple
    CFDs
  • incremental techniques for validating CFDs

A practical method for data cleaning and schema
matching
83
References
  • Conditional Functional Dependencies for Data
    Cleaning The 23rd International Conference on
    Database Engineering (ICDE), 2007.Philip
    Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia,
    Anastasios Kementsietsidis
  • Extending Dependencies with Conditions Loreto
    Bravo, Wenfei Fan, Shuai Ma
  • Improving Data Quality Consistency and Accuracy
    Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia,
    Shuai Ma
  • Conditional Functional Dependencies for Capturing
    Data Inconsistencies Wenfei Fan, Floris Geerts,
    Xibei Jia, Anastasios Kementsietsidis
Write a Comment
User Comments (0)
About PowerShow.com