Data Quality - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Data Quality

Description:

Data-driven GUIs. Data dependence is specified in a collection of rules ... Data-driven GUIS. First, look for all independent attributes this is class 0 ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 42
Provided by: davidl116
Learn more at: https://cs.nyu.edu
Category:
Tags: data | guis | quality

less

Transcript and Presenter's Notes

Title: Data Quality


1
Data Quality
  • Class 5

2
Goals
  • Project
  • Data Quality Rules (Continued)
  • Example
  • Use of Data Quality Rules

3
Data Quality Rules Classes
  • 1)      Null value rules
  • 2)      Value rules
  • 3)      Domain membership rules
  • 4)      Domain Mappings
  • 5)      Relation rules
  • 6)      Table, Cross-table, and Cross-message
    assertions
  • 7)      In-Process directives
  • 8)      Operational Directives
  • 9)      Other rules

4
Representing Data Quality Rules
  • Data is divided into 2 sets
  • conformers
  • violators
  • Sets can be represented using SQL
  • Create SQL statements representing violating set

5
Using SQL
  • Direct queries
  • Embedded queries
  • Using ODBC/JDBC, can create validation scripts in
  • C
  • C
  • Java
  • Visual Basic
  • Etc.

6
Null Value Representations
  • Maintain a table of null representation types and
    names
  • create table nullreps (
  • name varchar(30),
  • nulltype char(1),
  • description varchar(1024),
  • source varchar(512),
  • nullval varchar(100),
  • nullrepid integer
  • )

7
Null Value Rules
  • Allows nulls
  • If the rule is allows nulls without any
    additional characterization
  • Nothing necessary
  • If the rule is allows nulls, but only of a
    specific type
  • Must check for real nulls (and possibly blanks
    and spaces)
  • SELECT from lttablegt WHERE lttablegt.ltattributegt
    is NULL

8
Null Value Rules
  • Does not allow nulls
  • Must check for nulls(and possibly blanks and
    spaces)
  • SELECT from lttablegt WHERE lttablegt.ltattributegt
    is NULL

9
Value Rules
  • Value rule is specified as some set of
    constraints
  • Makes use of operators and functions
  • , -, , /, lt, lt, gt, gt, !, , AND, OR
  • User defined functions
  • Example
  • value gt 0 AND value lt 100

10
Value Rules 2
  • Validation test is opposite of constraint
  • Use DeMorgans laws
  • If constraint was value gt 0 AND value lt 100),
    use
  • SELECT from lttablegt where lttablegt.ltattributegt lt
    0 OR
  • lttablegt.ltattributegt gt 100

11
Domain Membership
  • Domains are stored in a database table
  • Test for domain membership of an attribute is a
    test to make sure that all values are represented
    in domain table

12
Domain Reference Tables
  • create table domainref (
  • name varchar(30),
  • dtype char(1),
  • description varchar(1024),
  • source varchar(512),
  • domainid integer
  • )

13
Domain Reference Tables
  • create table domainvals (
  • domainid integer,
  • value varchar(128)
  • )

14
Domain Membership
  • Test for membership of attribute foo in the
    domain named bar
  • SELECT from lttablegt where foo not in
  • (SELECT value from domainvals where domainid
  • (SELECT domainid from domainref where
    domainref.name bar))

15
Domain Assignment
  • The values in the attribute define the domain
  • Find all the values not in the domain already
  • Update domain tables with those values

16
Domain Assignment 2
  • SELECT from lttablegt where foo not in
  • (SELECT value from domainvals where domainid
  • (SELECT domainid from domainref where
    domainref.name bar))
  • For all values in this set, create a record with
    (the value, the domain id for bar), and insert
    into domainvals.

17
Mapping Membership
  • Similar to domain membership, except
  • Must include domain membership tests for both
    values
  • Also must be looked up in the mapping tables

18
Completeness
  • Defines when a record is complete
  • Ex IF (Orders.Total gt 0.0), Complete With
  • Orders.Billing_Street,
  • Orders.Billing_City,
  • Orders.Billing_State,
  • Orders.Billing_ZIP
  • Format
  • Condition
  • List of fields that must be complete

19
Completeness 2
  • Equivalent to a set of null tests using condition
  • Select from lttablegt where ltcondition is truegt
    and ltlist of not null testsgt

20
Exemption
  • Defines which fields may be missing
  • IF (Orders.Item_Class ! CLOTHING) Exempt
  • Orders.Color,
  • Orders.Size
  • Format
  • Condition
  • List of fields that must be complete

21
Exemption 2
  • If condition is true, the fields may be null
  • Therefore, if condition is false, fields may not
    be null
  • Equivalent for test of opposite of condition and
    test for nulls

22
Consistency
  • Define a relationship between attributes based on
    field content
  • IF (Employees.title Staff Member) Then
    (Employees.Salary gt 20000 AND Employees.Salary lt
    30000)
  • Format
  • Condition
  • Assertion

23
Consistency 2
  • If condition is true, the assertion must be true
  • Equivalent to test for cases where the condition
    is true and the assertion is false
  • Select from lttablegt where ltconditiongt and not
    ltassertiongt

24
Derivation
  • Prescriptive form of consistency rule
  • Details how one attributes value is determined
    based on other attributes
  • IF (Orders.NumberOrdered gt 0) Then
  • Orders.Total (Orders.NumberOrdered
    Orders.Price) 1.05
  • Format
  • Condition
  • assignment

25
Derivation 2
  • The assigned fields must be updated if condition
    is true
  • Find all records where the condition is true
  • Generate update SQL calls with updated values
  • Execute updates

26
Functional Dependence
  • Functional Dependence between columns X and Y
  • For any two records R1 and R2 in a table,
  • if field X of record R1 contains value x and
    field X of record R2 contains the same value x,
    then if field Y of record R1 contains the value
    y, then field Y of record R2 must contain the
    value y.
  • In other words, attribute Y is said to be
    determined by attribute X.

27
Functional Dependence 2
  • Rule Format
  • Attribute X determines Attribute Y
  • Validation test makes sure that the functional
    dependence criterion is met
  • This means that if we extract the X value from
    the set of all distinct value pairs, that set
    should have no duplicates

28
Functional Dependence 3
  • Create view FD as select distinct X, Y from
    lttablegt
  • Select count () from FD
  • Select count (distinct X) from lttablegt
  • These should be the same numbers.

29
Primary Key/Uniqueness
  • A set of attributes defined as a primary key must
    uniquely identify a record
  • Can also be viewed as a uniqueness constraint
  • Format
  • attribute list is PRIMARY
  • attribute list is UNIQUE

30
Primary
  • Test to make sure that the number of distinct
    records with the expected key is the same as the
    number of records
  • Select count() from lttablegt
  • Select count (distinct ltattribute listgt) from
    lttablegt
  • These numbers should be the same

31
Uniqueness
  • Test for multiple record occurrences with the
    same set of values that should have been unique,
    if there is a separate known primary key
  • SELECT lttablegt.ltattributegt, lttablegt.ltattributegt
  • FROM lttablegt AS t1, lttablegt AS t2
  • WHERE t1.ltattributegt t2.ltattributegt and
    t1.ltprimarygt ltgt t2.ltprimarygt

32
Foreign Key
  • When the values in field f in table T is chosen
    from the key values in field g in table S, field
    S.g is said to be a foreign key for field T.f
  • If f is a foreign key, the key must exist in
    table S, column g (referential integrity)

33
Foreign Key 2
  • Similar to primary key
  • Test is to make sure that all values in foreign
    key field exist in target table
  • Select from ltsource tablegt where ltattributegt
    not in (Select distinct ltattributegt from lttarget
    tablegt)

34
Use of Data Quality Rules
  • Data Validation
  • Root Cause Analysis
  • Message Transformation
  • Data-driven GUIs
  • Metadata Collection

35
Data Validation
  • Translate rule set into select statements
  • Create a program that
  • Loads select statements into an array, indexed by
    a unique integer
  • Connects to database via ODBC
  • Iterates through the array of select statements
    those results

36
Data Validation 2
  • Each type of rule has an expected result check
    against the expected result
  • Outputs the result of each statement to output
    file, tagged by rule identifier
  • Results can be tallied to yield an overall
    percentage of valid records to total records

37
Root Cause Analysis
  • Root cause analysis can be started by looking at
    the counts of violated rules
  • Use the most frequently violated rule as a
    starting place

38
Message Transformation
  • Electronic Data Interchange
  • Use DQ rules to validate incoming messages
  • Use DQ rules (derivations, mappings) to transform
    incoming messages into an internal format

39
Data-driven GUIs
  • Data dependence is specified in a collection of
    rules
  • Generate equivalence classes of data values based
    on dependence specification

40
Data-driven GUIS
  • First, look for all independent attributes this
    is class 0
  • For class i, collect all attributes that depend
    on class (i 1)
  • The GUI will be constructed to iteratively
    request data from class 0..n
  • Based on the results from collecting data at step
    j, the rules associated with the actual values
    are applied, determining which values are
    requested at step j 1

41
Metadata Collection
  • Use domain and mapping derivation rules to
    collect metadata
  • Use other rules as a documentation of business
    operations
Write a Comment
User Comments (0)
About PowerShow.com